Best-of-N Distillation: Aligning LLMs with Human Feedback

🔊

💬 Ask

Large language models (LLMs) are revolutionizing the way we interact with technology. These models are trained in three stages. First, they are pre-trained on vast amounts of text data. Second, they are fine-tuned to follow instructions via supervised fine-tuning. Finally, reinforcement learning from human feedback (RLHF) is used to further improve the quality of generations.

A surprisingly simple yet effective strategy to improve LLM generations is Best-of-N sampling, where you generate N outputs for a given prompt and choose the one with the highest reward. While this method achieves excellent results, it comes with a high computational cost, as it requires generating and scoring N times more than a regular model.

A new paper from Google DeepMind proposes a novel approach, BOND (Best-of-N Distillation), that aims to emulate Best-of-N sampling without its computational overhead. BOND learns a policy that directly generates the best output from a single sample.

Instead of generating N outputs and selecting the best one, BOND uses distribution matching. It forces the distribution of the model’s outputs to be closer to the distribution of the Best-of-N outputs. By doing so, the model learns to generate outputs that are more similar to the best outputs, even without generating multiple samples.

To demonstrate its effectiveness, the researchers tested BOND on two tasks: abstractive summarization and dialogue generation. In both cases, BOND outperformed other RLHF algorithms, achieving better results on various benchmarks.

Here are some concrete examples to illustrate the main ideas:

Abstractive Summarization: Imagine you want to summarize a news article about a new scientific discovery. A standard RLHF model might produce a summary that is accurate but lacks clarity or conciseness. BOND, on the other hand, learns to generate summaries that are both accurate and engaging, mimicking the quality of the Best-of-N approach but with a single sample.
Dialogue Generation: Let’s say you are building a chatbot that needs to respond to user queries in a conversational manner. A regular RLHF model might struggle to produce responses that are both informative and engaging. BOND learns to generate responses that are more natural and engaging, by matching its output distribution to the Best-of-N output distribution.

Overall, BOND represents a significant advance in RLHF techniques. It offers a powerful way to improve LLM performance while reducing computational costs. By aligning LLMs with human feedback more effectively, BOND paves the way for safer and more reliable LLMs.

AI Papers Reader

Personalized digests of latest AI research

Best-of-N Distillation: Aligning LLMs with Human Feedback

Chat about this paper