BiasLens: A New Way to Evaluate Bias in Large Language Models Without Labeled Data
Large Language Models (LLMs) are increasingly used in various applications, but they often inherit and amplify social biases from their training data. These biases can lead to unfair or unreliable outputs, especially when the model links unrelated concepts in a skewed manner. For example, if an LLM associates “doctor” more strongly with “male” than “female,” it reveals a gender bias.
Existing methods to evaluate these biases typically involve creating labeled datasets to measure model responses across different social groups. This process is labor-intensive and captures only a limited set of social concepts. Imagine needing to manually craft thousands of sentences like “The male doctor…” and “The female doctor…” just to assess gender bias in one specific profession. This approach clearly doesn’t scale well to the myriad of potential biases and concepts.
Now, researchers from MBZUAI and Huazhong University of Science and Technology have proposed a new framework called BiasLens that bypasses the need for manual test sets. This innovative approach, detailed in a recent paper, uses the internal structure of the LLM’s vector space to detect biases.
BiasLens leverages two key techniques: Concept Activation Vectors (CAVs) and Sparse Autoencoders (SAEs). CAVs identify directions in the model’s activation space that represent specific concepts, such as “doctor” or “male.” SAEs then help interpret these directions by mapping them to a high-dimensional, sparse space of human-interpretable features.
Here’s how it works with the “doctor” and gender example:
- CAV Derivation: BiasLens trains linear classifiers to identify the “doctor”, “male”, and “female” concepts within the LLM’s activation space.
- Concept Representation Extraction: The framework extracts the LLM’s activations when processing text related to “doctor.” It then uses CAVs to steer the model towards representations more strongly associated with “doctor.” By subtracting model activations before and after steering, BiasLens then uses a pre-trained sparse autoencoder to project both into a high-dimensional sparse space where dimensions are designed to correspond to independently meaningful semantic attributes. The difference between the resulting sparse vectors is then understood to be a “concept representation”.
- Bias Score Calculation: The process is repeated for “male” and “female.” BiasLens then calculates the cosine similarity between the “doctor” vector and the “male” and “female” vectors. The difference between these similarities indicates the degree of bias. A large difference suggests the model associates “doctor” more strongly with one gender than the other.
The researchers validated BiasLens on various LLMs, including Gemma 2 and Llama 3, and found strong agreement with traditional bias evaluation metrics (Spearman correlation r > 0.85). Moreover, BiasLens uncovered biases that existing methods struggle to detect. For instance, it revealed that an LLM in simulated clinical scenarios showed biased diagnostic assessments based on a patient’s insurance status – a subtle bias hard to capture with predefined test sets.
BiasLens offers a scalable, interpretable, and efficient way to discover biases in LLMs. This advancement paves the way for creating fairer and more transparent LLMs for widespread applications.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.