AI Papers Reader

Personalized digests of latest AI research

View on GitHub

AI Referees: Language Models Debate Their Way to Truth in Conflicting Contexts

Large language models (LLMs) are powerful tools, but they often stumble when their internal knowledge clashes with information provided in the context. This leads to factual errors and “hallucinations,” where the model confidently asserts something false. Researchers at Brown University, Stanford, the University of New South Wales, and other institutions have developed a novel framework called Self-Reflective Debate for Contextual Reliability (SR-DCR) that helps LLMs navigate these knowledge conflicts more effectively.

SR-DCR works by pitting two AI agents against each other in a structured debate. One agent, the “defender,” has access to the potentially conflicting context and argues for its validity. The other agent, the “critic,” is deliberately kept in the dark, relying solely on the LLM’s pre-existing knowledge to challenge the context. A third agent, the “judge,” observes the debate and determines whether the context is trustworthy. The final answer is chosen based on the judge’s verdict combined with the LLM’s own confidence in its initial, context-free answer.

Imagine the question: “In which year was American football player Lee Kizzire born?” The LLM might “remember” that he was born in 1915. However, if the provided context says “Lee Kizzire (1955-1943) is an American football player,” the model is faced with a conflict. In SR-DCR, the defender agent would argue that the context is reasonable because it clearly mentions the birth year as 1955. The critic agent, lacking context, would argue, “I disagree with this context. Kizzire enlisted in the Army Air Corps during WWII…”. The judge would then weigh the arguments and decide whether the context is reliable. If the judge deems the context unreliable, the final answer would be the LLM’s original answer (1915) if the LLM is highly confident in it.

The researchers tested SR-DCR on a dataset called ClashEval, which is specifically designed to create scenarios where the context contains misleading information. They found that SR-DCR consistently outperformed other methods, including classical debate frameworks and techniques that rely solely on confidence scores.

For instance, traditional “few-shot prompting”, where the LLM is given a few example questions and answers, often performs well on standard questions but falters badly when the context is deliberately misleading, with accuracy dropping from 99.3% to 9.0% in the tests. SR-DCR, on the other hand, maintained a higher accuracy (95.7%) on standard questions while significantly improving performance in conflicting situations (accuracy of 29.7%), achieving nearly equivalent results to a gold standard where the LLM always knows whether the context is good or bad.

The framework has low computational overhead, meaning it can be implemented without significantly increasing the LLM’s processing time. The researchers believe that SR-DCR represents a step forward in building more reliable and trustworthy LLMs. The code is available on GitHub: https://github.com/smiles724/Self-Reflective-Debates.