dig beats the published bar — fully local

Retrieval recall through dig's real scan → store → index pipeline, on CPU with small open embedding models. No reranker, no LLM in the loop.

98.0%hit@5 on LongMemEval-S (full 500 questions)vs. MemPalace's published 96.6% on the same benchmark and model class

LongMemEval-S

all-MiniLM-L6-v2 (Q8)

Full official set — 500 questions, 19,829 sessions

Metric	FTS	Vector	Hybrid
recall@5	60.9%	93.3%	93.9%
recall@10	62.5%	97.0%	97.1%
hit@5	66.8%	97.8%	98.0%
hit@10	67.2%	98.8%	98.8%
ndcg@10	61.1%	90.4%	92.1%
mrr	64.1%	90.5%	92.8%

Hybrid hit@5 of 98.0% clears MemPalace's published 96.6% recall_any@5 on the same model class, fully local. Chat sessions paraphrase heavily — exactly the recall gap the vector index closes over FTS (+31.2pts).

LoCoMo

nomic-embed-text-v1.5 (Q8)

1,536 questions, multi-session conversations

Metric	FTS	Vector	Hybrid
recall@5	80.4%	78.1%	85.3%
recall@10	86.0%	88.2%	93.2%
hit@5	87.1%	83.7%	91.3%
hit@10	91.3%	92.3%	97.1%
ndcg@10	75.6%	71.6%	78.9%
mrr	75.4%	68.9%	76.8%

Hybrid beats the deterministic FTS baseline on every metric from rank 3 up (+4.9pts recall@5, +5.8pts hit@10). FTS stays the zero-dependency default, available per query.

BEAM (128K tier)

all-MiniLM-L6-v2 (Q8)

355 questions, 5,732 turns — turn-level evidence

Metric	FTS	Vector	Hybrid
recall@5	31.3%	36.0%	34.8%
recall@10	39.4%	46.1%	43.4%
hit@5	49.0%	54.6%	51.8%
hit@10	58.3%	64.8%	62.3%
ndcg@10	29.9%	35.4%	34.4%
mrr	35.0%	41.2%	40.9%

The unsaturated frontier (ICLR 2026): one long conversation, near-duplicate turns, evidence annotated per turn. Semantic beats FTS on every metric; here vector edges hybrid — weak lexical rankings dilute RRF. Larger tiers are backfilling.

Published numbers from other systems

Shown for context. Retrieval recall and LLM-judged QA accuracy are different measurements — recall is structurally higher, since finding the evidence is easier than answering from it. dig reports retrieval metrics because dig is the retrieval layer serving knowledge-base management; answering belongs to the agent driving it.

MemPalace96.6% recall_any@5 · LongMemEval

Same metric as dig's hit@5 — its default-embeddings number; with its own structure enabled it drops to 89.4% (rooms) / 84.2% (compressed).

mem094.4% QA accuracy (LLM judge) · LongMemEval

Different metric — end-to-end answer quality, not retrieval recall. Structurally lower than recall; not directly comparable.

mem092.5% QA accuracy (LLM judge) · LoCoMo

Different metric — answer quality at ~6.9k tokens per query.

Method

Each benchmark is mapped onto a real dig knowledge base (one file per session or turn), indexed with the production FTS + vector pipeline, and scored per question against its evidence with standard IR metrics (recall@k, hit@k, NDCG@10, MRR). FTS is the deterministic default; vector and hybrid (Reciprocal Rank Fusion) opt in via a [retrieval] policy. Full methodology and reproduction commands live in docs/evals.md.