Sanity Retrieval After Benchmarking: What Actually Holds Up in Production

The update this post needed

I originally approached Sanity retrieval as a convenience win: fewer moving parts, no external vector store, and a clean GROQ + semantic workflow.

That part is still true. But after running a labeled benchmark with negative controls, the more important truth is this:

Semantic retrieval can be very strong on in-domain relevance and still fail badly on robustness if you ship it without a confidence gate.

This post reflects what held up under testing.

What still works

Sanity remains an excellent retrieval foundation for CMS-native RAG:

structured content model
composable filtering in GROQ
first-party semantic retrieval options
provenance fields (_id, _rev) for auditability

Architecturally, it is still cleaner than maintaining an external sync pipeline for many teams.

What changed after rigorous evaluation

In our reproducible benchmark, semantic retrieval outperformed lexical retrieval on relevance metrics (Precision@k, Recall@k, MRR, nDCG) for in-domain queries.

At the same time, semantic retrieval produced non-empty results for out-of-domain negative controls at a rate that is too high for production trust if left ungated.

Lexical retrieval showed the opposite profile:

lower in-domain recall
better selectivity
far fewer out-of-domain false positives

This is not a contradiction. It is a tradeoff. Semantic gives better coverage; lexical gives stricter conservatism.

Practical retrieval architecture now

The production pattern we recommend now is:

semantic retrieval as primary candidate generator
confidence gate on semantic acceptance
lexical fallback for resilience
GROQ policy filters and provenance logging

Without step 2, you get strong demos and fragile production behavior.

Why confidence gating matters

A semantic hit is not the same thing as a grounded answer path.

For short, broad, or out-of-domain prompts, semantic systems often return plausible but weakly relevant context. If your answer layer always consumes that context, you raise the risk of confident, wrong answers.

Confidence gating introduces a refusal path when retrieval confidence is low. That is the missing reliability control.

What to implement first

If you are currently using Sanity for semantic retrieval, prioritize:

a score threshold or score+agreement policy for semantic acceptance
explicit negative-control prompts in your eval seed
periodic benchmark reruns as content changes

Good retrieval quality is not static; it drifts with corpus growth.

Evidence summary

In our updated labeled benchmark run:

semantic retrieval led on in-domain relevance metrics
lexical retrieval was more selective and produced fewer out-of-domain acceptances
ungated semantic acceptance was the main robustness risk
mixed confidence gating gave the best quality/robustness operating point in this corpus

If you are serious about retrieval quality, benchmarked regression checks should be part of your regular release process.