AI Learns to Reward Itself: Explainable Rewards and Bayesian Optimization for Language Models
A new paper from researchers at the University of Minnesota, Georgia Institute of Technology, Grammarly, and the University of Arizona proposes a method for improving reinforcement learning from human feedback (RLHF) in large language models (LLMs) by optimizing reward shaping. The technique, dubbed “Explainable Dense Reward Shapes via Bayesian Optimization,” aims to address the issue of sparse feedback signals in LLM training, where rewards are typically assigned only at the end of a sequence, making it difficult for the model to learn which intermediate steps were beneficial.
The core idea is to create a more granular, token-level reward system by leveraging explainability methods, such as SHAP and LIME, to estimate the contribution of each token to the overall reward. This dense reward signal is then shaped using Bayesian optimization, which learns the optimal weighting of different explainability scores to improve policy training.
How it Works: An Analogy
Imagine training a dog with treats. Instead of only giving a treat after the dog successfully completes a complex trick, you want to reward individual actions that contribute to the trick. For example, for the command “play dead,” you want to give treats for:
- lying down
- rolling over
- staying still
The problem becomes how do you decide the weight of the treat for each of these three actions? If the dog performs action #1, but is wiggling, should it get a full treat?
The researchers propose a similar token-level reward system for LLMs. Here, the “trick” is a generated response to a prompt (e.g., answering a question), and the individual actions are the generation of each token. Explainability methods (SHAP and LIME) are used to determine the importance of each token in relation to the overall reward (a judgement of how good the answer is).
A language model generates the statement “Chili powder is a spice”. The reward model judges how good the statement is (e.g. “r(s,a) = +8.2”). Explainability methods are used to estimate how each of the words contributes to the overall reward; you get an “explanation vector” like “(0.233, 0.295, 0.23, …, 0.138, 0.484, 0.113)”. You want to “redistribute” the reward across these scores, to create a new dense reward.
Then, Bayesian optimization comes in. It finds the optimal way to weigh each word’s importance, to shape the reward function.
Key Advantages:
- Faster Learning: By providing more frequent and granular feedback, the model can learn more efficiently.
- Improved Performance: Experiments show that this approach leads to performance improvements on downstream tasks compared to baseline methods.
- Policy Invariance: The researchers theoretically demonstrate that their reward-shaping method maintains the optimal policy of the original reward function, ensuring that the model learns the “right” thing.
- More Stable Value Function Updates: Empirically, the method shows more stable value function updates, suggesting a better approximation of states to long-term returns.
Experimental Results
The researchers evaluated their method on two dialogue datasets, HH-RLHF and Ultrafeedback, and found that it consistently outperformed baselines in terms of reward scores and win rates against a baseline SFT model. They also demonstrated that their approach is less prone to reward overfitting, a common problem in RLHF.
Future Directions
The authors suggest that future research could focus on incorporating contextual information into the Bayesian optimization process to dynamically shape token-level rewards, potentially by learning weights based on token embeddings. This would allow the reward shaping to adapt to the specific context of the generated sequence, leading to even better performance.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.