dig beats the published bar — fully local

Retrieval recall through dig's real scan → store → index pipeline, on CPU with small open embedding models. No reranker, no LLM in the loop.

98.0%hit@5 on LongMemEval-S (full 500 questions)vs. MemPalace's published 96.6% on the same benchmark and model class

LongMemEval-S

all-MiniLM-L6-v2 (Q8)

Full official set — 500 questions, 19,829 sessions

MetricFTSVectorHybrid
recall@560.9%93.3%93.9%
recall@1062.5%97.0%97.1%
hit@566.8%97.8%98.0%
hit@1067.2%98.8%98.8%
ndcg@1061.1%90.4%92.1%
mrr64.1%90.5%92.8%

Hybrid hit@5 of 98.0% clears MemPalace's published 96.6% recall_any@5 on the same model class, fully local. Chat sessions paraphrase heavily — exactly the recall gap the vector index closes over FTS (+31.2pts).

LoCoMo

nomic-embed-text-v1.5 (Q8)

1,536 questions, multi-session conversations

MetricFTSVectorHybrid
recall@580.4%78.1%85.3%
recall@1086.0%88.2%93.2%
hit@587.1%83.7%91.3%
hit@1091.3%92.3%97.1%
ndcg@1075.6%71.6%78.9%
mrr75.4%68.9%76.8%

Hybrid beats the deterministic FTS baseline on every metric from rank 3 up (+4.9pts recall@5, +5.8pts hit@10). FTS stays the zero-dependency default, available per query.

BEAM (128K tier)

all-MiniLM-L6-v2 (Q8)

355 questions, 5,732 turns — turn-level evidence

MetricFTSVectorHybrid
recall@531.3%36.0%34.8%
recall@1039.4%46.1%43.4%
hit@549.0%54.6%51.8%
hit@1058.3%64.8%62.3%
ndcg@1029.9%35.4%34.4%
mrr35.0%41.2%40.9%

The unsaturated frontier (ICLR 2026): one long conversation, near-duplicate turns, evidence annotated per turn. Semantic beats FTS on every metric; here vector edges hybrid — weak lexical rankings dilute RRF. Larger tiers are backfilling.

Published numbers from other systems

Shown for context. Retrieval recall and LLM-judged QA accuracy are different measurements — recall is structurally higher, since finding the evidence is easier than answering from it. dig reports retrieval metrics because dig is the retrieval/memory layer; answering belongs to the agent driving it.

MemPalace96.6% recall_any@5 · LongMemEval

Same metric as dig's hit@5 — its default-embeddings number; with its own structure enabled it drops to 89.4% (rooms) / 84.2% (compressed).

mem094.4% QA accuracy (LLM judge) · LongMemEval

Different metric — end-to-end answer quality, not retrieval recall. Structurally lower than recall; not directly comparable.

mem092.5% QA accuracy (LLM judge) · LoCoMo

Different metric — answer quality at ~6.9k tokens per query.

Method

Each benchmark is mapped onto a real dig knowledge base (one file per session or turn), indexed with the production FTS + vector pipeline, and scored per question against its evidence with standard IR metrics (recall@k, hit@k, NDCG@10, MRR). FTS is the deterministic default; vector and hybrid (Reciprocal Rank Fusion) opt in via a [retrieval] policy. Full methodology and reproduction commands live in docs/evals.md.