Language Models Reveal Their Uncertainty: A New Approach to Confidence Estimation
Language models (LMs) like GPT-4 are powerful tools, but they’re not perfect. Sometimes, they confidently produce incorrect answers. To help users discern reliable from unreliable outputs, researchers are developing methods for LMs to accurately assess their own confidence. A new paper, “Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences,” proposes a novel approach that significantly improves the accuracy of LM confidence estimations.
Traditional methods ask an LM to directly rate its confidence on a scale (e.g., 0-1). However, these “absolute confidence” methods often fail. LMs tend to give similar high confidence scores across the board, regardless of accuracy, limiting their usefulness. For example, a model might give a 0.9 confidence score to both a correctly answered question about simple arithmetic and a completely incorrect answer about complex astrophysics. This lacks discriminative power.
The researchers introduce “relative confidence estimation,” which side-steps the problem of absolute judgment. Instead of asking the model to rate its own certainty, they pit questions against each other. The model is presented with two questions and their respective answers and asked: “Which question are you more confident in answering correctly?”. This approach is akin to a head-to-head matchup in a tournament.
Imagine comparing two multiple-choice questions:
- Question A: What is 2 + 2?
- Question B: What is the capital of Kazakhstan?
A human would likely be far more confident in answering Question A. The proposed method asks the LM to make this comparison. By repeating this process across many question pairs, the researchers build up a comprehensive picture of the model’s confidence preferences. This data, reflecting the model’s relative judgments, is then fed into rank aggregation algorithms (like Elo rating, used in chess, or Bradley-Terry, used in sports analytics) to translate these preferences into numerical confidence scores for each individual question.
The researchers tested their method against five state-of-the-art LMs (GPT-4, GPT-40, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B) on 14 diverse question-answering datasets covering STEM, social science, and commonsense reasoning. They compared their relative confidence estimation with direct absolute confidence estimation (simply asking the model for a confidence score) and self-consistency approaches (repeatedly asking the model and averaging its confidence assessments).
Their results demonstrate that relative confidence estimation consistently outperforms the other approaches. On average, the method improved selective classification AUC (a metric that measures how well a model can choose when to answer and when to abstain) by 3.5% over direct absolute confidence estimation and by 1.7% over self-consistency methods. The gains were even more significant for some models and datasets; for example, relative confidence improved the selective classification AUC for Llama 3.1 405B by 6.1% compared to direct prompting.
This research provides a valuable step forward in reliable confidence estimation for LMs. By harnessing the model’s ability to make relative judgments, rather than forcing it into the difficult task of absolute self-assessment, it achieves more accurate and informative confidence scores that are better suited to helping users make informed decisions when interacting with these powerful, but fallible, AI systems. Future work will explore how to further refine the method to handle more complex tasks, like open-ended question answering and long-form text generation.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.