AI Papers Reader

Personalized digests of latest AI research

View on GitHub

"Chain-of-Defensive-Thought" Improves LLM Robustness Against Misinformation

Large language models (LLMs) have become increasingly adept at answering questions using external information. This is often done via a method known as Retrieval-Augmented Generation (RAG), where LLMs search for and incorporate relevant information from external sources like the internet. However, this reliance on external data opens up a vulnerability: if the referenced information is inaccurate or intentionally misleading, the LLM’s responses can be compromised.

A new paper proposes a simple but effective technique called “chain-of-defensive-thought” to mitigate this risk, significantly improving LLM robustness against corrupted reference data. The method draws inspiration from the way humans handle conflicting information, by prompting the LLM to first identify relevant sources and then cross-validate information across them before arriving at an answer.

The core idea involves providing LLMs with a few examples demonstrating this structured reasoning process. Instead of directly answering a question based on given references, the LLM is guided to:

  1. Identify relevant contexts: Determine which provided sources contain information pertinent to the query.

  2. Assess reliability: Evaluate the consistency and support for information across different sources. This helps identify and discount potentially unreliable or contradictory references.

  3. Answer defensively: Based on the most reliable and relevant sources, answer the query.

For instance, imagine an LLM is asked a question about the height of the Eiffel Tower, and is given ten external articles. Nine articles correctly state the height, but one article has been tampered with, and claims a wildly incorrect height. A standard LLM might be misled by this corrupted reference. In contrast, an LLM using the “chain-of-defensive-thought” method will:

  • Recognize that the corrupted article’s height claim contradicts the other nine articles.
  • Identify the nine consistent articles as the most reliable source of information.
  • Provide an answer based on the consistent data, effectively ignoring the misinformation.

The paper’s authors tested this method across a range of LLMs, including GPT-4o, GPT-3.5-turbo, and several open-source models. They evaluated performance on tasks like answering factual questions from the Natural Questions and RealTime QA datasets, in the presence of reference corruption attacks such as prompt injection and knowledge corruption. The results showed remarkable improvements in robustness.

For example, on the Natural Questions dataset, the accuracy of GPT-4o plummeted from 60% to as low as 3% when even just 1 out of 10 provided references was corrupted with prompt injection attacks. However, with “chain-of-defensive-thought” prompting, GPT-4o maintained an accuracy of 50% under the same attack conditions. The researchers found that this technique consistently boosts the resilience of LLMs when external references have been compromised. The key benefit is that it requires no additional training, instead relying on clever prompt engineering to elicit robust reasoning.