Why reasoning models break confidence-based hallucination filters

The result that stood out

Across three models in our gating experiment, the filter had no measurable effect on false-allow rate. Expected, to a degree.

What was not expected: for o3-mini, the gate was completely transparent in both directions. False-allow A = 18.8%. False-allow B = 18.8%. False-fallback A = 14.7%. False-fallback B = 14.7%.

Exactly the same. The gate processed every answer; it blocked nothing it should have blocked, and nothing it should not have.

For comparison, gpt-4o-mini had zero false-allows with or without the gate. The gate’s small cost there was a slight increase in false-fallbacks: paraphrases that were technically correct but got blocked for insufficient word overlap.

With o3-mini, the gate simply did not register.

How the gate works

The gate checks two signals per answer:

Citation validity: cited document IDs must exist in the retrieved set.
Lexical grounding: answer text must share sufficient word overlap with retrieved chunks, above a threshold.

Both signals are designed to catch answers that drift from the source material. The assumption is that a hallucinated answer will either cite documents that were not retrieved, or use language that does not appear in the retrieved context.

Reasonable assumption. Breaks for reasoning models.

What reasoning models do during generation

A reasoning model does not jump from prompt to answer. It works through an extended chain before producing output: decomposing the question, evaluating candidate answers, checking internal consistency.

This process naturally pulls vocabulary and structure from the context window. By the time the model produces its final answer, that answer has been shaped by repeated exposure to the retrieved documents, even if the factual claims diverge from them.

The result is an answer that reads like it came from the corpus because, in a terminological sense, it did. The model absorbed the domain language, the entity names, the phrasing patterns; it just did not always get the facts right.

Why both signals fail

Citation validity: o3-mini reliably cites document IDs from the retrieved set. The reasoning process includes cross-checking references, so by the final answer, the cited IDs are real. The check passes; the content may still be wrong.

Lexical grounding: the answer shares high word overlap with retrieved content because the model’s reasoning process has already internalized that vocabulary. A hallucinated claim about, say, feature availability will use the exact right product names and terminology mined from the retrieved chunks. The overlap threshold is met.

Both signals confirm that the answer looks like the corpus. Neither checks whether it agrees with the corpus. For most models, that gap is small enough to be useful. For reasoning models, the gap is the whole problem.

What would actually work

The signal the gate needs is not lexical overlap, it is semantic entailment: does the retrieved content support the specific claims in the answer?

A cross-encoder trained on NLI (natural language inference) can check this directly. Given an answer sentence and a retrieved chunk, it outputs a probability that the chunk entails the sentence. If no retrieved chunk entails a given claim above threshold, the answer fails.

This is more computationally expensive than word overlap and requires a second model in the pipeline. Whether that cost is acceptable depends on the application.

The citation check is still worth keeping as a first-pass filter. It is cheap and catches models that cite entirely fabricated sources. For reasoning models specifically, it is just not sufficient on its own.