REFIND: A New Method for Detecting Hallucinations in Large Language Models
Large language models (LLMs) are increasingly used in various applications, but their tendency to generate factually incorrect information, a phenomenon known as “hallucination,” poses a significant challenge. A new paper introduces REFIND (Retrieval-augmented Factuality Hallucination Detection), a novel framework designed to pinpoint and classify hallucinated spans within LLM outputs. Unlike previous methods, REFIND directly leverages retrieved documents to assess the context sensitivity of each generated token, providing a more accurate and efficient approach to hallucination detection.
Hallucinations manifest as fabricated facts or claims inserted into an LLM’s response. For example, an LLM asked “Who directed the movie Citizen Kane?” might respond, “Orson Welles directed Citizen Kane in 1945, and it won eight Academy Awards.” While the first part is correct, the number of awards is a hallucination. This misinformation can have serious consequences, especially in knowledge-intensive fields.
Existing approaches to hallucination detection often rely on pre-trained language models or internal knowledge. Token-level classifiers, for instance, label individual words or phrases as factual or hallucinatory, but these can struggle with low-resource languages or miss context. Retrieval-augmented methods, which incorporate external information, can improve accuracy but often involve complex multi-step processes, potentially leading to alignment errors between the original and modified responses.
REFIND offers a different approach. It introduces the Context Sensitivity Ratio (CSR), a metric quantifying how much a token’s probability of generation changes when considering retrieved documents versus the LLM’s internal knowledge alone. A high CSR indicates a strong dependence on external context, suggesting that the token is likely a hallucination.
Imagine the LLM generating the sentence “The capital of France is Paris, and it is famous for its croissants and Eiffel Tower.” REFIND analyzes each word. “Paris” would likely have a low CSR, as it is a well-known fact readily available in the LLM’s knowledge base and supported by retrieved documents. However, a claim about French croissant consumption might yield a higher CSR if such a claim is unsupported by retrieved text, indicating a hallucination.
The researchers evaluated REFIND using the SemEval 2025 Task 3: Mu-SHROOM dataset, a multilingual benchmark containing LLM responses annotated for hallucinated spans across nine languages. REFIND significantly outperformed baseline methods like token-level classifiers and FAVA (a state-of-the-art retrieval-augmented approach) achieving a superior average IoU (Intersection over Union) score – a measure of overlap between predicted and actual hallucinated spans. This demonstrates its robustness across high and low-resource languages.
However, REFIND’s effectiveness relies on the quality of the retriever. Inaccurate retrieval can affect CSR calculations, and the computational cost of evaluating token probabilities with and without contextual evidence might limit applications requiring very low latency. Future work should address these limitations to further enhance the reliability of REFIND. Despite these limitations, REFIND represents a significant advancement in accurately identifying hallucinated information from LLMs, improving their trustworthiness and utility.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.