CrowdSelect: A New Method for Efficient Instruction Tuning of Language Models
Researchers at Huazhong University of Science and Technology and South China University of Technology have developed a novel method for selecting high-quality instruction data for training smaller, more efficient language models. Their approach, called CrowdSelect, leverages the “wisdom” of multiple large language models (LLMs) to identify valuable instruction-response pairs, significantly outperforming existing methods. The work is detailed in a recent preprint paper, “CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom.”
Current methods for instruction tuning often rely on single, simplistic metrics such as reward scores or model perplexity to select training data. These metrics are limited in their ability to capture the complexity and nuances inherent in diverse instructions and responses. CrowdSelect addresses this limitation by incorporating three new foundation metrics:
-
Difficulty: This metric assesses how challenging an instruction is for multiple LLMs. A high difficulty score suggests an instruction that most LLMs struggle with, making it a valuable addition to the training data. For example, a complex reasoning problem might receive a high difficulty score, indicating its usefulness for training.
-
Separability: This metric quantifies the variance in response quality across different LLMs for a given instruction. High separability implies that different LLMs produce substantially different quality responses to the same prompt, suggesting this instruction is particularly effective in distinguishing between the strengths and weaknesses of different models. For instance, an ambiguous instruction could yield high separability, as some LLMs might successfully understand it while others do not.
-
Stability: This metric evaluates the consistency with which LLM performance aligns with its expected ranking. It checks if larger models consistently perform better than smaller ones. High stability indicates that the selected instruction reliably reinforces the expected size-based performance ranking, ensuring that the training data helps to reinforce well-grounded alignment signals.
CrowdSelect integrates these three metrics using a clustering-based approach, aiming to maintain a diverse set of selected instructions while prioritizing those that achieve high scores across the three metrics. This multi-faceted approach to data selection is superior to methods that use single metrics or predefined rules, providing a more comprehensive assessment of instruction-response pairs’ qualities.
Experimental results demonstrate the effectiveness of CrowdSelect. The researchers tested their approach on two commonly used benchmarks: MT-bench and Arena-Hard, using four different smaller language models. Across these models, CrowdSelect achieved state-of-the-art performance, yielding improvements of up to 11.1% on MT-bench and 4.81% on Arena-Hard compared to other data selection techniques. These improvements demonstrate the significant advantages of incorporating multi-LLM wisdom for enhancing instruction tuning.
While promising, the authors acknowledge some limitations. Their work primarily relies on pre-collected data and the use of multiple LLMs and reward models, each introducing potential biases that require further investigation. Nevertheless, this research presents a substantial advancement in instruction tuning data selection and demonstrates the potential of utilizing multi-LLM wisdom to improve the efficiency and effectiveness of training smaller language models. The code for CrowdSelect is publicly available on GitHub.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.