SALSA: A New Approach for Better Aligning Language Models with Human Preferences

🔊

💬 Ask

Large language models (LLMs) have made significant progress in natural language processing (NLP) tasks, but aligning them with human values and preferences remains a challenge. Reinforcement learning from human feedback (RLHF) is a promising technique for addressing this issue, but traditional RLHF methods often rely on a single reference model that can limit the model’s exploration of the reward landscape, leading to suboptimal alignment.

A research paper published in the 38th Workshop on Fine-Tuning in Machine Learning (NeurIPS 2024) proposes a new approach called SALSA (Soup-based Alignment Learning for Stronger Adaptation) to overcome this limitation. SALSA utilizes a “model soup” as the reference model in RLHF, which is created by averaging the weights of multiple independently fine-tuned models (SFTs). This approach allows for larger deviation in KL divergence, enabling the policy model to explore a broader and more promising region of the parameter space without sacrificing stability.

The researchers found that SALSA consistently outperformed PPO, a widely-used RLHF algorithm, across various benchmarks, including MT-Bench, Arena-Hard, and UltraFeedback. This was particularly true for out-of-distribution datasets, where SALSA demonstrated superior performance by fostering deeper exploration and achieving superior alignment in LLMs.

Here’s an intuitive example of how SALSA works:

Imagine you’re trying to train a model to write poetry. You could use a single SFT model as a reference, but it might be too rigid and limit the model’s creativity. Instead, you could create a model soup by averaging the weights of several SFT models that have been trained on different styles of poetry. This would give the model a more diverse understanding of what good poetry looks like, allowing it to explore a wider range of possibilities and potentially produce more unique and compelling poems.

The paper’s findings suggest that SALSA can be a valuable tool for aligning LLMs with human preferences and achieving better performance on a wide range of NLP tasks. It’s a promising development in the field of AI safety and ethics, as it can help to ensure that LLMs are aligned with human values and avoid harmful biases.

The researchers recommend further investigating the use of model soups with more than two SFT models. They also suggest exploring alternatives to the KL-Hack, a phenomenon that can occur when using SALSA with a small KL divergence coefficient, which can lead to models generating nonsensical outputs.

AI Papers Reader

Personalized digests of latest AI research

SALSA: A New Approach for Better Aligning Language Models with Human Preferences

Chat about this paper