AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Segmenting Text for Improved Reinforcement Learning in Language Models

Reinforcement learning from human feedback (RLHF) is a powerful technique for aligning large language models (LLMs) with human preferences. However, traditional RLHF methods often suffer from the “sparse reward” problem, where feedback is only provided at the end of text generation, making it difficult to optimize the model’s behavior at each step. A new paper, “Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model,” tackles this issue by proposing a novel segment-level reward model.

Instead of assigning rewards to individual tokens (as in some recent dense token-level RLHF approaches) or the entire text sequence (as in traditional bandit RLHF), the researchers propose assigning rewards to semantically meaningful segments of text. A segment might be a short phrase or even a single word that forms a complete semantic unit. This approach captures the sequential nature of language generation while avoiding the over-subtlety of token-level rewards.

The key idea is that a segment represents a coherent chunk of meaning, making it a more appropriate unit for reward assignment than a single token (which often has little independent meaning) or the entire sequence (which provides delayed feedback). For example, in the phrase “The quick brown fox jumps over the lazy dog,” “The quick brown fox” could be considered one segment and “jumps over the lazy dog” another, each receiving its own reward. This is preferable to rewarding each word individually or only after the entire sentence is completed.

The paper introduces a method for dynamic text segmentation based on the entropy of the language model’s predictions. Essentially, segments are identified by points where the model’s uncertainty about the next token increases significantly – indicating a potential boundary between distinct semantic units. This ensures that the segments correspond more accurately to natural language structures.

To further improve the effectiveness of RL-based training, the researchers introduce a location-aware reward normalizer. Traditional RLHF uses the overall mean and standard deviation of rewards to normalize the reward signals, potentially leading to bias and instability in the PPO optimization. The new approach, instead, uses a linear regression model to predict the mean and standard deviation of rewards at different locations (percentages) within a text sequence. This accounts for the inherent bias in the beginning or ending sentences, which would typically be highly dependent on the prompt and less about the continuation of the previous text. For example, introductory words typically have less to do with continuation of meaning than the middle of a sequence of text.

Finally, the researchers interpolate segment-level rewards to individual tokens, creating an even denser training signal. This process helps to refine the model’s learning by providing feedback at a more granular level.

The method is evaluated on three popular benchmarks: AlpacaEval 2.0, Arena-Hard, and MT-Bench. The results demonstrate that the segment-level reward model significantly outperforms both traditional bandit RLHF and recent token-level approaches across all metrics, including accuracy and response length. Furthermore, the study includes ablation studies supporting the design choices of the proposed methods, including a demonstration of the benefits of using average-based reward aggregation and location-aware reward normalization.

This work offers a promising advancement in RLHF, demonstrating that a more nuanced approach to reward assignment, considering the sequential and semantic structure of language, can substantially improve the effectiveness of training LLMs and lead to better aligned models. The techniques presented offer a substantial step toward solving the long-standing problem of sparse rewards in sequential decision-making tasks such as language generation.