dig beats the published bar — fully local
Retrieval recall through dig's real scan → store → index pipeline, on CPU with small open embedding models. No reranker, no LLM in the loop.
LongMemEval-S
all-MiniLM-L6-v2 (Q8)Full official set — 500 questions, 19,829 sessions
| Metric | FTS | Vector | Hybrid |
|---|---|---|---|
| recall@5 | 60.9% | 93.3% | 93.9% |
| recall@10 | 62.5% | 97.0% | 97.1% |
| hit@5 | 66.8% | 97.8% | 98.0% |
| hit@10 | 67.2% | 98.8% | 98.8% |
| ndcg@10 | 61.1% | 90.4% | 92.1% |
| mrr | 64.1% | 90.5% | 92.8% |
Hybrid hit@5 of 98.0% clears MemPalace's published 96.6% recall_any@5 on the same model class, fully local. Chat sessions paraphrase heavily — exactly the recall gap the vector index closes over FTS (+31.2pts).
LoCoMo
nomic-embed-text-v1.5 (Q8)1,536 questions, multi-session conversations
| Metric | FTS | Vector | Hybrid |
|---|---|---|---|
| recall@5 | 80.4% | 78.1% | 85.3% |
| recall@10 | 86.0% | 88.2% | 93.2% |
| hit@5 | 87.1% | 83.7% | 91.3% |
| hit@10 | 91.3% | 92.3% | 97.1% |
| ndcg@10 | 75.6% | 71.6% | 78.9% |
| mrr | 75.4% | 68.9% | 76.8% |
Hybrid beats the deterministic FTS baseline on every metric from rank 3 up (+4.9pts recall@5, +5.8pts hit@10). FTS stays the zero-dependency default, available per query.
BEAM (128K tier)
all-MiniLM-L6-v2 (Q8)355 questions, 5,732 turns — turn-level evidence
| Metric | FTS | Vector | Hybrid |
|---|---|---|---|
| recall@5 | 31.3% | 36.0% | 34.8% |
| recall@10 | 39.4% | 46.1% | 43.4% |
| hit@5 | 49.0% | 54.6% | 51.8% |
| hit@10 | 58.3% | 64.8% | 62.3% |
| ndcg@10 | 29.9% | 35.4% | 34.4% |
| mrr | 35.0% | 41.2% | 40.9% |
The unsaturated frontier (ICLR 2026): one long conversation, near-duplicate turns, evidence annotated per turn. Semantic beats FTS on every metric; here vector edges hybrid — weak lexical rankings dilute RRF. Larger tiers are backfilling.
Published numbers from other systems
Shown for context. Retrieval recall and LLM-judged QA accuracy are different measurements — recall is structurally higher, since finding the evidence is easier than answering from it. dig reports retrieval metrics because dig is the retrieval/memory layer; answering belongs to the agent driving it.
Same metric as dig's hit@5 — its default-embeddings number; with its own structure enabled it drops to 89.4% (rooms) / 84.2% (compressed).
Different metric — end-to-end answer quality, not retrieval recall. Structurally lower than recall; not directly comparable.
Different metric — answer quality at ~6.9k tokens per query.
Method
Each benchmark is mapped onto a real dig knowledge base (one file per session or turn), indexed with the production FTS + vector pipeline, and scored per question against its evidence with standard IR metrics (recall@k, hit@k, NDCG@10, MRR). FTS is the deterministic default; vector and hybrid (Reciprocal Rank Fusion) opt in via a [retrieval] policy. Full methodology and reproduction commands live in docs/evals.md.