Implicit Model Fusion: A New Technique for Combining Large Language Models
A team of researchers from Sun Yat-sen University in China has developed a novel technique for combining the strengths of multiple large language models (LLMs) without the computational overhead of traditional methods. Their approach, detailed in a new preprint, uses “Weighted-Reward Preference Optimization” (WRPO) to implicitly fuse the capabilities of different open-source LLMs into a single, more powerful model.
Existing methods for combining LLMs face significant challenges. Ensemble methods, which aggregate predictions from multiple models, are computationally expensive because all models must be active during inference. Mixture-of-Experts (MoE) models, while more efficient, still require substantial memory. Model merging, which combines models with identical architectures, is limited in its applicability. Explicit Model Fusion (EMF), which leverages knowledge distillation, requires complex vocabulary alignment and distribution merging, potentially introducing noise and errors.
WRPO offers a different approach. Instead of explicitly combining models, it implicitly transfers capabilities by training a target LLM to prefer responses similar to those generated by stronger source LLMs. This is achieved through a preference optimization process guided by a reward model. The key innovation is a progressive adaptation strategy: the system initially relies on preferred responses from the target model itself, gradually shifting to responses generated by the source models. This smooths out distributional differences between the source and target models, improving the effectiveness of the preference learning.
For example, imagine using LLaMA3-8B-Instruct as the target model and several larger open-source LLMs (e.g., Mistral-Large-Instruct, Gemma2-27B) as source models. WRPO would train LLaMA3-8B-Instruct by presenting it with pairs of responses: one preferred (initially from LLaMA3-8B-Instruct itself, later from the source LLMs) and one dispreferred (from LLaMA3-8B-Instruct). The reward model evaluates the quality of the responses. Over time, the algorithm learns to generate responses more similar to those preferred by the stronger source LLMs. This avoids the complexity of explicitly merging vocabularies and aligning probability distributions.
The researchers tested WRPO on three benchmarks: MT-Bench, AlpacaEval-2, and Arena-Hard. The results consistently show that WRPO outperforms existing fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct, WRPO achieved a length-controlled win rate of 55.9% against GPT-4-Preview on AlpacaEval-2 and a win rate of 46.2% against GPT-4-0314 on Arena-Hard. These impressive results demonstrate WRPO’s effectiveness at leveraging the capabilities of multiple LLMs without the computational burden of traditional approaches.
The authors also conducted ablation studies to examine the impact of different components of WRPO, such as the progressive adaptation strategy and the weighting of internal rewards. These studies confirmed the importance of each component in achieving optimal performance. Finally, they investigated the scalability of WRPO by varying the number of source LLMs, finding that performance generally improves as the number of source models increases.
WRPO presents a significant advance in the field of LLM fusion. Its implicit approach, combined with the progressive adaptation strategy, offers a computationally efficient and effective method for combining the strengths of multiple LLMs. This has potential implications for improving the capabilities of existing LLMs and reducing the resource requirements for training and deploying more powerful language models.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.