Large language models can now learn to be safe on the fly
đź“„ Full Paper
đź’¬ Ask
A new technique lets large language models adapt their safety settings on the fly, without needing to be retrained. The method, called Controllable Safety Alignment (CoSA), could help developers make models that are more appropriate for specific tasks and audiences.
Today’s large language models (LLMs) are often designed to be “safe” by refusing to interact with any content deemed unsafe by the model provider. But this approach has limitations. First, it lacks flexibility, as social norms vary across cultures and regions. Second, it doesn’t account for users with diverse safety needs.
For example, a model designed for general use might refuse to generate violent content, even if a game developer needs it to generate violent scenes for their video game. Retraining the model to allow violence would be expensive and time-consuming.
“The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach,” say the researchers in a new paper, “The model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned.”
CoSA addresses this problem by allowing models to adapt their safety settings at inference time. It works by incorporating “safety configs”—free-form natural language descriptions of desired safety behaviors—into the system prompt.
For instance, a safety config for a game developer might specify that violence is allowed, while a safety config for a harassment training manager might specify that hateful language is not allowed.
CoSA uses a technique called CoSAlign to align LLMs to these safety configs. CoSAlign first derives a “risk taxonomy”—a set of categories representing different types of risks—from training prompts. It then generates diverse synthetic data to create a preference dataset, which is then used to fine-tune the model.
The researchers show that CoSAlign significantly outperforms other methods for controlling model safety, including in-context alignment, where a model is provided with a few examples of desired behavior.
The paper also includes a new evaluation protocol that considers both helpfulness and configured safety, summarizing them into a single “CoSA-Score.” The researchers also develop CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements.
Overall, CoSA is a promising new approach to safety alignment that could make LLMs more flexible and practical. It could help developers create models that are better suited to specific tasks and audiences, without sacrificing safety.
The researchers acknowledge that CoSA could also be used for malicious purposes, such as creating models that are more likely to generate harmful content. To address this risk, they suggest that CoSA should only be used by authorized users who are able to carefully review and modify safety configs. They also suggest that models should be designed to be more resistant to prompt injection attacks, which could be used to manipulate safety settings.
The paper was published on the preprint server arXiv, and the code and data are publicly available.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.