AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Best-of-N Distillation: Aligning LLMs with Human Feedback

Large language models (LLMs) are revolutionizing the way we interact with technology. These models are trained in three stages. First, they are pre-trained on vast amounts of text data. Second, they are fine-tuned to follow instructions via supervised fine-tuning. Finally, reinforcement learning from human feedback (RLHF) is used to further improve the quality of generations.

A surprisingly simple yet effective strategy to improve LLM generations is Best-of-N sampling, where you generate N outputs for a given prompt and choose the one with the highest reward. While this method achieves excellent results, it comes with a high computational cost, as it requires generating and scoring N times more than a regular model.

A new paper from Google DeepMind proposes a novel approach, BOND (Best-of-N Distillation), that aims to emulate Best-of-N sampling without its computational overhead. BOND learns a policy that directly generates the best output from a single sample.

Instead of generating N outputs and selecting the best one, BOND uses distribution matching. It forces the distribution of the model’s outputs to be closer to the distribution of the Best-of-N outputs. By doing so, the model learns to generate outputs that are more similar to the best outputs, even without generating multiple samples.

To demonstrate its effectiveness, the researchers tested BOND on two tasks: abstractive summarization and dialogue generation. In both cases, BOND outperformed other RLHF algorithms, achieving better results on various benchmarks.

Here are some concrete examples to illustrate the main ideas:

Overall, BOND represents a significant advance in RLHF techniques. It offers a powerful way to improve LLM performance while reducing computational costs. By aligning LLMs with human feedback more effectively, BOND paves the way for safer and more reliable LLMs.