Claude vs. GPT: Which is Better at Admitting "I Don't Know"?

Posted on 2026-05-18 07:42:44

If I had a dollar for every time a stakeholder asked me for the "hallucination rate" of a model, I wouldn’t need to worry about RAG (Retrieval-Augmented Generation) infrastructure anymore. In nine years of deploying knowledge systems in regulated industries—banking, pharma, and defense—I’ve learned one immutable truth: the question itself is broken. When people ask, "Is Claude or GPT better at admitting when it doesn't know the answer?", they are looking for a single percentage to make a buying decision. They’re looking for a silver bullet. They won't find one.

What they will find is a complex game of abstention behavior, where the trade-off between "helpfulness" and "honesty" shifts depending on how the model was RLHF-trained and what specific failure mode you are testing for.

The Myth of the "Hallucination Rate"

Stop asking for a "hallucination rate." It does not exist in a vacuum. A hallucination is a categorical failure, but it manifests in different ways:

Extrinsic Hallucination: The model makes up a fact not supported by your provided context. Intrinsic Hallucination: The model contradicts the source text it was just given. Refusal Failure (Over-refusal): The model says "I don't know" when the answer *is* in the provided source (a massive pain in enterprise search). False Certainty: The model confidently provides an answer to a nonsensical or unanswerable query.

When a vendor claims a "near-zero hallucination rate," they are usually benchmarking a specific, narrow task on a clean dataset. In production, your users aren’t querying a clean dataset; they are querying messy, overlapping, and contradictory document stores. The benchmark doesn't measure the model; it measures the test designer’s optimism.

Definitions Matter: Abstention vs. Factuality

Before we compare models, let’s get our terminology straight. This is how we audit these systems in the real world:

Faithfulness: Does the output stay strictly within the provided context? (This is a RAG-specific metric). Factuality: Does the output align with ground truth in the real world? (This is a general training metric). Abstention Behavior: The model’s ability to recognize a query as unanswerable—either because the information is missing or the question is malformed.

A model that is highly faithful might refuse to answer a question that it *actually knows* the answer to via its pre-training, simply because the provided RAG context is silent on it. That’s a "refusal strategy" choice. Is that a bug or a feature? In a compliance-heavy environment, that’s usually a feature.

Benchmark Analysis: The AA-Omniscience Framework

When looking at academic benchmarks like AA-Omniscience, you need to understand what is actually being measured. AA-Omniscience assesses "Answer Accuracy - Omniscience," which specifically tests whether a model can distinguish between questions that *can* be answered by a given knowledge base and those that *cannot*.

Metric GPT-4o (OpenAI) Claude 3.5 Sonnet (Anthropic) Default Refusal Tendency Lower (Biased towards "Helpfulness") Higher (Biased towards "Safety/Honesty") Context Adherence High, but prone to "hallucinating" if forced Very High; stricter interpretation "I Don't Know" Accuracy Often needs prompt engineering Natively better at flagging gaps

So What? If you are building a tool for financial advisors, the table above suggests that Claude’s default state is more risk-averse. If you use GPT-4o, you must invest more heavily in system-level instructions to force the model to reject queries that fall outside your knowledge base. If you use Claude, you might find yourself fighting "lazy" refusals where the model is *too* afraid to answer.

The Reasoning Tax on Grounded Summarization

There is a hidden cost to demanding that a model be honest: the reasoning tax. When you instruct a model to "only answer if you are 100% sure based on the provided context," you are essentially forcing the model to run a verification loop internally before emitting a token.

In RAG workflows, this often looks like:

Context Retrieval: Pulling documents. Evaluation: Can the question be answered? Generation: Drafting the response.

If you don't use a separate "check" step (an agentic workflow), you are relying on the model’s latent ability to perform self-correction. Claude 3.5 Sonnet tends to perform this "reasoning tax" more explicitly in its output structure—it is slower to "say" things because it is more cautious. GPT-4o is optimized for latency and "helpfulness," meaning it will try to answer even if the source material is shaky, unless you explicitly tighten the prompt.

Refusal Strategy: How to Engineer "I Don't Know"

Do not rely on the model’s "natural" ability to say "I don't know." You must build the refusal strategy into your system architecture. Here is how I approach this in production environments:

1. The "Negative Constraint" Prompting

Never just ask the model to be honest. Give it a specific, rigid trigger. Example: "If the provided context does not contain the answer, you must respond exactly with: 'I cannot answer this question based on the provided documents.'"

2. The Multi-Pass Audit Trail

Stop treating citations as proof. Treat them as an audit trail. If you are citing a document, don't just ask the model to summarize; ask it to provide the exact sentence/page. If it can't find a source, it has a programmatic reason to trigger an abstention.

3. Temperature and Top-P

In the context of "I factuality in large language models don't know," a high temperature is your enemy. Keep temperature near 0.0 or 0.1 for RAG tasks. Creativity is the enemy of honesty. If the model is allowed to be "creative," it will eventually synthesize a hallucination that sounds exactly like a fact.

Conclusion: The "Best" Model is the One You Can Control

If your priority is preventing legal liability, Claude 3.5 Sonnet currently feels more "stubborn" in a way that aligns with enterprise safety requirements. It is more likely to refuse a query that lacks sufficient grounding. GPT-4o is a powerhouse of reasoning, but it is "eager." It wants to please the user, which makes it inherently more dangerous in a RAG system where the context might be incomplete.

Neither model is "better" at admitting "I don't know" in Learn more here a vacuum. They both possess the capability. The difference lies in their training biases—OpenAI prioritizes the task completion, while Anthropic’s current iteration prioritizes the safety boundary. For your team, the choice should be driven by which failure mode you can tolerate more: an incomplete, frustrating answer (refusal) or a confident, plausible-sounding lie (hallucination).

Choose your failure mode carefully. That is the only real metric that matters.