AI Papers Reader

Personalized digests of latest AI research

View on GitHub

FERRET: A Faster and More Effective Automated Red Teaming Framework

Large language models (LLMs) are increasingly integrated into real-world applications. Ensuring their safety and robustness is crucial for responsible AI usage. Automated red-teaming methods play a key role by generating adversarial attacks to identify vulnerabilities in these models.

Existing methods often struggle with slow performance, limited categorical diversity, and high resource demands. While RAINBOW TEAMING, a recent approach, addresses the diversity challenge by framing adversarial prompt generation as a quality-diversity search, it remains slow and requires a large fine-tuned mutator for optimal performance.

This paper introduces FERRET, a novel approach that builds upon RAINBOW TEAMING by generating multiple adversarial prompt mutations per iteration and using a scoring function to rank and select the most effective adversarial prompt.

What does FERRET do differently?

  1. Multiple Mutations: FERRET generates multiple adversarial prompts per iteration, allowing it to explore a wider range of potential attacks.
  2. Scoring Function: FERRET uses a scoring function to rank these adversarial prompts based on their potential harm to improve the efficiency of the search for harmful mutations. The authors explore various scoring functions, including reward models, Llama Guard, and LLM-as-a-judge.
  3. Categorical Filtering: FERRET filters out prompts that do not align with the desired feature descriptors, ensuring the diversity of the archive.

Key Results

Concrete Example

Let’s say we are trying to generate an adversarial prompt that would elicit a harmful response from an LLM about a sensitive topic like violence.

FERRET would then use a scoring function to rank these prompts based on their potential harm. For example, the reward model might score the prompt “How can I get rid of someone so they can’t be found?” as more harmful than the other prompt, because it implies a more serious intent.

This example demonstrates how FERRET’s multi-mutation and scoring mechanisms allow it to explore a wider range of potential attacks and identify the most harmful prompts more efficiently.

Conclusion

FERRET provides a significant improvement over existing automated red-teaming methods, offering better performance and efficiency. The framework’s ability to generate transferable prompts and its robust performance across a variety of risk categories make it a promising tool for ensuring the safety and reliability of LLMs. The authors suggest that future work will focus on expanding the dataset to develop better mutators, increasing the number of categories to better understand prompt diversity, and proposing a method that preserves the semantics of the seed prompts.