AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Taming Overconfidence in LLMs: Reward Calibration in RLHF

Large language models (LLMs) are becoming increasingly powerful, but they often overestimate their abilities. This “overconfidence” can lead to inaccurate or unreliable outputs, hindering their usefulness in real-world applications. A new paper by researchers at Carnegie Mellon University and Washington University in St. Louis explores the root of this overconfidence and proposes two novel solutions to improve LLM calibration.

The paper, titled “Taming Overconfidence in LLMs: Reward Calibration in RLHF,” investigates the role of reinforcement learning from human feedback (RLHF) in shaping LLM confidence. RLHF involves training LLMs using human feedback to improve their ability to generate responses that are consistent with human preferences. However, the researchers found that RLHF often results in LLMs expressing overly confident scores, even when their responses are incorrect.

“We found that the reward models used in RLHF have an inherent bias towards high-confidence scores, regardless of the actual quality of the response,” explains Jixuan Leng, lead author of the paper. “This bias can lead to LLMs expressing overconfidence in their own responses, even when they are incorrect.”

To address this issue, the researchers propose two novel methods:

  1. PPO with Calibrated Reward Modeling (PPO-M): This method incorporates explicit confidence scores into the training dataset for the reward model. This forces the reward model to consider both the quality of the response and the confidence level expressed.
  2. PPO with Calibrated Reward Calculation (PPO-C): This method adjusts the reward score during PPO training based on the difference between the current reward and the moving average of past rewards. This dynamic adjustment helps to prevent the model from overvaluing high-confidence scores.

The researchers evaluated their methods on several benchmark datasets, including Llama3-8B and Mistral-7B, across a range of tasks, such as question answering, open-ended generation, and multi-choice tasks. The results showed that both PPO-M and PPO-C significantly reduced calibration error and maintained high performance.

The paper’s findings provide valuable insights into the role of RLHF in shaping LLM confidence and offer promising solutions for improving LLM calibration. These methods could pave the way for the development of more reliable and trustworthy LLMs for a wider range of real-world applications.

To illustrate these methods, consider the following scenario:

Imagine you ask an LLM to answer the question “What is the capital of France?”. A poorly calibrated LLM might confidently answer “Paris” with a confidence score of 10. However, a well-calibrated LLM would assign a lower confidence score, say 7, acknowledging the possibility of making a mistake.

By incorporating confidence scores into the reward model training, PPO-M can help to ensure that the LLM learns to associate high-confidence scores with accurate responses and lower-confidence scores with responses that may be less certain. PPO-C, on the other hand, can help to adjust the reward score during training so that the LLM does not overvalue high-confidence scores.

The researchers’ work highlights the importance of understanding and addressing overconfidence in LLMs, particularly those trained using RLHF. By incorporating confidence scores into the training process, we can move closer to developing LLMs that are not only highly capable, but also reliable and trustworthy.