I tested Sanity's search APIs on a real LLM pipeline. Not what I expected.

Update (2026-03-05)

This post remains directionally correct, but our newer benchmark runs materially refined the conclusions:

Semantic retrieval outperformed lexical retrieval on in-domain relevance.
Semantic retrieval without confidence gating produced high out-of-domain false positives.
Production recommendation is now: semantic-first + confidence gate + lexical fallback.

Reader-visible evidence snapshot from the updated benchmark pass:

Labeled set: 24 queries across in-domain, paraphrase stress, keyword stress, and out-of-domain controls.
In-domain quality: semantic retrieval outperformed lexical on Precision@k, Recall@k, MRR, and nDCG.
Robustness gap: semantic retrieval without a gate produced high false-positive behavior on out-of-domain controls.
Practical fix: a mixed confidence policy (score threshold plus lexical agreement fallback) reduced out-of-domain false positives while preserving most semantic quality.

The setup

I needed a corpus for an LLM retrieval experiment. The questions: given a user query, can the system retrieve the right documents, and can an LLM answer accurately from them?

I already had structured content in Sanity: B2B SaaS documentation, the kind with clear document boundaries and well-defined fields. The question was whether I could use Sanity itself as the retrieval backend, or whether I would need to pipe everything into a dedicated vector store.

I decided to test it before assuming the answer.

What Sanity’s embeddings index actually does

Sanity has a first-party embeddings index feature. You configure it against a dataset, define which document types and fields to index, and it handles the rest: chunking, embedding, and exposing a semantic search endpoint.

No separate vector DB. No sync job to maintain. The index stays current automatically.

Querying it looks like this:

const results = await client.fetch(
  `*[_type == "post"] | score(pt::text(body) match $query) | order(score desc) [0...5]`,
  { query: "pricing plans for enterprise customers" }
)

Or, using the dedicated semantic endpoint:

const results = await client.request({
  url: `/vX/embeddings-index/search/${indexName}`,
  method: "POST",
  body: { query: "pricing plans for enterprise customers", maxResults: 5 },
})

The second form returns document IDs and similarity scores. You can then fetch the full documents with a follow-up GROQ query, filtering by _id.

The hybrid pattern

The part that worked better than expected: combining semantic search with GROQ filters.

Semantic search finds the right meaning. GROQ filters to the right scope: specific document types, date ranges, publication status.

In practice, the pipeline was:

Run semantic search to get candidate _id values.
Run a GROQ query: *[_id in $ids && _type == "article" && !(_id in path("drafts.**"))]
Pass the filtered documents to the LLM as grounding context.

This keeps the retrieval fast and the context clean. Semantic search handles ambiguity; GROQ handles policy.

The thing I had not considered

Every Sanity document carries _id and _rev.

_rev is the revision hash. It encodes exactly which version of a document the model saw at retrieval time. If the document was edited between retrieval and the user reading the answer, the mismatch is detectable.

For an audit layer, this is genuinely useful. You can record which document IDs and revisions were in the context at inference time, and trace any answer back to a specific snapshot of the corpus.

Most vector stores do not give you this for free. You have to bolt it on.

What did not work as expected

The semantic search is accurate enough for well-structured content. For short or ambiguous queries it sometimes retrieved loosely related documents; you can see why the embedding matched, but the content was not actually useful.

Lexical grounding (word overlap between the answer and the retrieved content) also turned out to be a weak signal for hallucination detection. But that is a separate problem, and a separate post.

Verdict

If you already have content in Sanity and need a retrieval layer for an LLM feature, the embeddings index is worth testing before reaching for a dedicated vector store. The hybrid GROQ + semantic pattern is practical. The _id/_rev provenance is the most underrated part of the whole stack.

A CMS with a working semantic search API, structured provenance, and a composable query layer is more than I expected going in.