Highlighted Chain of Thought Prompting Improves Large Language Model Accuracy and User Verification

🔊

💬 Ask

Large language models (LLMs) are powerful tools, but they sometimes “hallucinate”—producing inaccurate or nonsensical information. A new paper, “HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs,” introduces a prompting technique called Highlighted Chain of Thought (HoT) designed to mitigate this problem. The research, appearing on arXiv, shows HoT consistently improves LLM accuracy and speeds up human verification of LLM responses.

The core of HoT lies in its approach to prompting. Instead of simply asking a question, HoT prompts the LLM to first re-format the question, wrapping key facts in XML tags (e.g., <fact1>, <fact2>). Then, the LLM is prompted to generate a step-by-step response (a “chain of thought”), highlighting the facts used in the answer using the same XML tags. This allows for easy visual tracing of where the LLM derived its information.

For example, consider the question: “Jenna starts with 8 sapphires. She trades 3 for 2 rubies. Sapphires are worth $800, rubies $1200. How much are her jewels worth?” A standard chain of thought might produce a correct answer, but without highlighting the factual basis. HoT, however, would re-format the question as: “Jenna starts with <fact1>8 sapphires</fact1>. She trades 3 sapphires for two rubies. Sapphires are worth $800, rubies $1200`. How much are her jewels worth?”

The LLM’s answer would then incorporate similar highlighting. For instance, “Jenna starts with <fact1>8 sapphires</fact1>…leaving her with <fact1>5 sapphires</fact1>. These <fact1>5 sapphires</fact1> are worth…Her <fact3>rubies</fact3> are worth <fact5>$2400</fact5>. In total, her jewels are worth…” This clear linkage between question and answer greatly enhances transparency and makes verification easier.

The researchers evaluated HoT across five LLMs (Gemini-1.5-Flash, Gemini-1.5-Pro, Llama-3.1-70B, Llama-3.1-405B, and GPT-4) and seventeen diverse tasks, ranging from arithmetic and logical reasoning to reading comprehension. In almost every case, HoT significantly outperformed standard chain-of-thought prompting, yielding average accuracy gains of 1.6%, 2.58%, and 2.53% in arithmetic, question answering, and logical reasoning tasks, respectively.

Interestingly, the study also involved human participants tasked with verifying LLM responses. The presence of highlighted facts in HoT answers noticeably improved their speed, completing the task about 25% faster. However, there was a surprising side effect: while highlights helped users identify correct answers, they also led them to overestimate the accuracy of incorrect answers. This finding highlights a potential pitfall of using highlights to boost human confidence in AI outputs.

The researchers also conducted ablation studies, examining the impact of different aspects of the HoT prompt (e.g., repeating the question, highlighting only in the question or only in the answer). Their findings show that highlighting both the question and the answer is crucial to maximizing accuracy.

The HoT prompting technique offers a promising way to improve LLM accuracy and make their reasoning process more transparent. However, the study also reminds us to be cautious about over-reliance on highlighting as a measure of answer validity, especially for users. Future research should explore ways to mitigate this potential bias and further optimize the design of effective highlighting strategies for enhancing human-AI collaboration.

AI Papers Reader

Personalized digests of latest AI research

Highlighted Chain of Thought Prompting Improves Large Language Model Accuracy and User Verification

Chat about this paper