A Comprehensive Toolkit for Safety Evaluation of Large Language Models

💬 Ask

Large Language Models (LLMs) have become increasingly sophisticated, but their growing capabilities also raise safety concerns. Bias, harmful content generation, and even the potential for malicious use necessitate robust safety evaluation tools. Enter WALLEDEVAL, a new comprehensive AI safety testing toolkit developed by researchers at Walled AI Labs.

WALLEDEVAL is designed to address a wide range of safety concerns, including:

Harmful content generation: LLMs might produce unsafe content like instructions for building explosives or hate speech.
Exaggerated safety: Models might be overly cautious, refusing to answer even harmless questions. This can be problematic when users need factual information.
Multilingual safety: LLMs should be safe across different languages, as they are increasingly used in multilingual contexts.

The toolkit features several key aspects to enable thorough evaluation:

Open-Weight and API-Based Model Support: WALLEDEVAL can evaluate both open-weight models (like Llamas and Mistrals) and API-based models (like ChatGPT and Claude), offering flexibility for diverse LLM testing.
Comprehensive Safety Benchmarks: The toolkit features over 35 safety benchmarks covering various areas like multilingual safety, exaggerated safety, and prompt injection vulnerabilities. For instance, the “Aya Red-Teaming” dataset tests for potential biases and discrimination in responses, while the “XSTest” benchmark evaluates exaggerated safety behaviors.
Judge Support: WALLEDEVAL allows for the benchmarking of “judges”—systems that determine if a given text is safe or unsafe. This is crucial for ensuring that the safety evaluations are accurate. It even includes a new content moderator called WALLEDGUARD, which is significantly smaller than existing tools like LlamaGuard but achieves comparable performance.
Mutations: WALLEDEVAL introduces “mutators” to test how an LLM responds to different styles of prompts. For example, mutators can change the tense of a sentence or introduce misspellings, helping to uncover vulnerabilities that might be missed with standard prompts.

The paper highlights a new benchmark called SGXSTEST, specifically designed to assess exaggerated safety within the cultural context of Singapore. This benchmark includes prompts that are carefully phrased to challenge LLMs’ safety boundaries while considering the specific cultural sensitivities of the region.

WALLEDEVAL also offers support for “LLMs-as-a-Judge”—using LLMs themselves to evaluate the safety of other LLMs. This approach introduces new challenges, as the performance of the “judge” LLM needs to be carefully evaluated.

The paper showcases the effectiveness of WALLEDEVAL by conducting experiments on a wide range of LLMs, including open-weight models like Llamas and Mistrals and closed-weight models like ChatGPT and Claude. The results demonstrate that WALLEDEVAL is a valuable tool for understanding and improving the safety of LLMs. For example, the experiments reveal that some models are prone to excessive safety measures, while others exhibit vulnerabilities when presented with prompts in different languages or styles.

The researchers emphasize that WALLEDEVAL is a powerful tool for promoting the safe development and deployment of LLMs. By providing a comprehensive framework for evaluating LLM safety, the toolkit can help researchers, developers, and policymakers identify and address potential risks, ultimately contributing to the creation of more reliable and trustworthy AI systems.

AI Papers Reader

Personalized digests of latest AI research

A Comprehensive Toolkit for Safety Evaluation of Large Language Models

Chat about this paper