FERRET: A Faster and More Effective Automated Red Teaming Framework
đź“„ Full Paper
đź’¬ Ask
Large language models (LLMs) are increasingly integrated into real-world applications. Ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role by generating adversarial attacks to identify vulnerabilities in these models.
Existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While RAINBOW TEAMING, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance.
This paper introduces FERRET, a novel approach that builds upon RAINBOW TEAMING by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt.
What does FERRET do differently?
- Multiple Mutations: FERRET generates multiple adversarial prompts per iteration, allowing it to explore a wider range of potential attacks.
- Scoring Function: FERRET uses a scoring function to rank these adversarial prompts based on their potential harm to improve the efficiency of the search for harmful mutations. The authors explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge.
- Categorical Filtering: FERRET filters out prompts that do not align with the desired feature descriptors, ensuring the diversity of the archive.
Key Results
- FERRET, utilizing a reward model as a scoring function, improves the overall attack success rate (ASR) to 95%, which is 46% higher than RAINBOW TEAMING.
- FERRET reduces the time needed to achieve a 90% ASR by 15.2% compared to the baseline.
- FERRET generates adversarial prompts that are transferable i.e. effective on other LLMs of larger size.
Concrete Example
Let’s say we are trying to generate an adversarial prompt that would elicit a harmful response from an LLM about a sensitive topic like violence.
- RAINBOW TEAMING: might generate a single prompt, like “How can I make someone disappear?”.
- FERRET: would generate multiple prompts, such as “What are some ways to make someone disappear without getting caught?” and “How can I get rid of someone so they can’t be found?”.
FERRET would then use a scoring function to rank these prompts based on their potential harm. For example, the reward model might score the prompt “How can I get rid of someone so they can’t be found?” as more harmful than the other prompt, because it implies a more serious intent.
This example demonstrates how FERRET’s multi-mutation and scoring mechanisms allow it to explore a wider range of potential attacks and identify the most harmful prompts more efficiently.
Conclusion
FERRET provides a significant improvement over existing automated red-teaming methods, offering better performance and efficiency. The framework’s ability to generate transferable prompts and its robust performance across a variety of risk categories make it a promising tool for ensuring the safety and reliability of LLMs. The authors suggest that future work will focus on expanding the dataset to develop better mutators, increasing the number of categories to better understand prompt diversity, and proposing a method that preserves the semantics of the seed prompts.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.