Self-Boosting Large Language Models with Synthetic Preference Data
📄 Full Paper
💬 Ask
Large Language Models (LLMs) have made significant progress in following user instructions and generating helpful and harmless responses. This progress has largely been driven by model alignment, a process of training reward models or LLMs directly on datasets curated from human preferences. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs.
A new paper introduces SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences.
Here’s how SynPO works:
- Self-Prompt Generator: SynPO first trains a self-prompt generator to create large-scale synthetic prompts. Unlike previous approaches that require more powerful LLMs and instruction examples, SynPO’s generator utilizes only the LLM itself and three random keywords as input.
- Response Improver: SynPO then utilizes model-generated responses as rejected candidates and employs a response improver to refine these responses into chosen ones. This response improver is trained to identify distribution gaps between texts, making it more effective at refining a response than generating a high-quality response from scratch.
- Synthetic Preference Data: The self-prompt generator and response improver create synthetic preference data that is used to train the LLM. This process is iterated, with the LLM continuously improving its ability to generate high-quality responses.
To demonstrate the effectiveness of SynPO, the researchers conducted comprehensive experiments on both Mistral-Base 7B and Llama3-8B Base. After four SynPO iterations, Llama3-8B and Mistral-7B showed significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improved the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
The researchers also showed that SynPO generated more diverse prompts with lower inter-prompt similarity than other methods, suggesting that SynPO is more effective at generating synthetic data that covers a wider range of topics and intentions.
This new research suggests that SynPO is a promising approach to improve LLM alignment with human preferences. By leveraging synthetic preference data, SynPO can significantly reduce the cost and effort of collecting high-quality preference data, while still achieving significant improvements in LLM performance. This approach has the potential to accelerate the development of even more powerful and capable LLMs in the future.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.