New Benchmark Reveals Frontier LLMs Hallucinate in 30% of High-Stakes Multi-Turn Conversations
Large Language Models (LLMs) are still plagued by hallucinations—plausible but factually ungrounded claims—a critical vulnerability that worsens in complex, multi-turn dialogue. New research from EPFL and partner institutions has introduced HALLUHARD, the first hard, multi-turn hallucination benchmark designed to expose these failures in high-stakes contexts.
The benchmark’s findings are stark: even the strongest frontier LLMs, like Claude-Opus-4.5 configured with web search, still hallucinate in roughly 30% of challenging conversations. Without web search enabled, hallucination rates soared above 60% for many models tested.
HALLUHARD simulates real-world interactions across four niche, high-stakes domains: legal cases, medical guidelines, technical coding, and niche research questions. Unlike traditional fact-checking benchmarks that rely on simple single-turn Q&A, HALLUHARD requires models to engage in multi-turn dialogue, where early errors can propagate, and crucially, demands that all factual assertions be supported with verifiable, inline citations.
To accurately check these claims, the researchers developed an advanced, automated judge pipeline that goes beyond surface-level snippets. This system retrieves and parses the full text of cited documents, including PDFs, to confirm two critical points: first, that the reference itself is real (reference grounding), and second, that the specific claim made by the LLM is actually supported by the content within that source (content grounding).
The results show that the primary failure mode is content grounding. Models are generally successful at citing a legitimate source (e.g., a specific legal case or a medical guideline PDF), but they frequently fabricate the details they attribute to that source. For example, an LLM might correctly cite a recent scientific paper but then invent specific results or definitions that do not exist within the paper’s text.
The study also provides crucial insights into the dynamics of hallucination:
- Error Propagation: Hallucination rates consistently increased in later turns of a conversation. As models condition on their own past errors, they repeat incorrect citations or build new false claims on top of previous mistakes.
- The Niche Trap: LLMs struggle significantly more with niche facts—information that is real but obscure (e.g., an artwork shown in a local gallery or a scientific paper with few citations)—than with completely fabricated entities (like a totally invented fictional protein). When faced with truly fabricated items, models are more likely to abstain, but when faced with obscure, niche knowledge, they are incentivized to guess, often resulting in plausible-sounding but wrong answers.
- Reasoning is Limited: While activating a model’s “thinking” or reasoning mode generally reduces hallucination rates, simply increasing the level of reasoning effort does not guarantee further factual gains.
These findings underscore that reliance on LLMs for fact-intensive, high-stakes tasks remains premature without significantly improved fidelity. The data suggests that LLM development must focus heavily on improving uncertainty awareness and integrating robust, full-text web verification to handle the complexities of real-world knowledge.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.