Large Language Models Can Now Learn to Correct Themselves

🔊

💬 Ask

Large language models (LLMs) are becoming increasingly sophisticated, capable of generating text that is indistinguishable from human-written text. However, aligning these models with human preferences remains a challenge. Current methods often rely on human annotation of large datasets, which is expensive and time-consuming.

Researchers at the Chinese Academy of Sciences have developed a new algorithm called Self-Steering Optimization (SSO), which aims to automate the alignment process. SSO does not require human annotation and instead uses a novel approach to generate accurate and on-policy preference signals.

The key to SSO is its ability to continuously generate high-quality preference signals based on predefined principles. This means that the algorithm can iteratively learn to identify the best response from a model, without needing human intervention. This is achieved by focusing on the following two key objectives:

Accuracy of preference signals: SSO aims to maintain a consistent gap between chosen and rejected responses while ensuring that both are relevant to the current model’s learning capacity. This is achieved by carefully constructing contrastive prompts and training the model to differentiate between them.
On-policy behavior: SSO ensures that the chosen and rejected responses are sampled from the model’s current policy. This means that the preference signals are directly relevant to the model’s learning capabilities and can be used to guide its further development.

The researchers evaluated SSO on two foundation models, Qwen2 and Llama3.1, using various subjective and objective benchmarks. They found that SSO significantly improved the performance of the models on these benchmarks, even surpassing the performance of models trained on annotated data. In addition, they demonstrated that SSO could be used to generate accurate and on-policy preference data for training reward models.

SSO offers several advantages over existing methods. It is scalable and resource-efficient because it does not require human annotation. It is also highly effective in generating accurate and learnable preference signals, which leads to significant improvements in model alignment.

The development of SSO represents a significant step forward in automated alignment of LLMs. It offers a promising approach for developing more capable and trustworthy AI systems that can better understand and respond to human needs.

AI Papers Reader

Personalized digests of latest AI research

Large Language Models Can Now Learn to Correct Themselves

Chat about this paper