Large Language Models Can Be Aligned Without Instruction Conditioning

🔊

💬 Ask

Large language models (LLMs) are trained on massive amounts of text data to learn to predict the next word in a sequence. This means they implicitly learn a wide range of tasks, like writing different kinds of creative text, summarizing factual content, or translating between languages. To make LLMs useful as chat assistants, they are typically fine-tuned using instruction-response pairs. This process, called instruction tuning (IT), gives the model specific examples of how to respond to instructions and unsafe queries. However, in a new paper, researchers propose Response Tuning (RT), which eliminates the instruction-conditioning step and focuses solely on response space supervision. The researchers found that RT models, trained only using responses, can effectively respond to a wide range of instructions and exhibit helpfulness comparable to that of their instruction-tuned counterparts.

For example, a user might ask an LLM to write a short story about a cat named Mittens. An IT-trained model would learn to respond by generating a story that follows the instructions. An RT-trained model would learn by seeing many different short stories about cats and being able to generalize that information to produce a story about Mittens. The key idea is that pre-trained LLMs already have the capacity to follow instructions and generate diverse and coherent responses. By focusing on the response space, the researchers are able to tap into this latent potential and train models that are equally capable as those trained using the more traditional IT approach.

The researchers further demonstrate that controlling the response distribution during training can significantly improve user preference and elicit target behaviors such as refusing assistance for unsafe queries. For example, RT models can be trained to refuse to provide assistance with tasks that are potentially unsafe or harmful. This is achieved by including a small set of responses that explicitly refuse to answer unsafe queries in the training data. The researchers found that this approach effectively enables the model to learn to identify and reject unsafe queries without needing specific instruction-response pairs.

The results of the researchers’ experiments suggest that establishing an adequate output space is a crucial component of aligning LLMs with human needs. Their findings illuminate the potential of pre-trained LLMs and provide a powerful new approach to fine-tuning these models for real-world applications.

AI Papers Reader

Personalized digests of latest AI research

Large Language Models Can Be Aligned Without Instruction Conditioning

Chat about this paper