AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Enhancing Automated Interpretability with Output-Centric Feature Descriptions: A More Accurate Understanding of Language Models

A new paper from researchers at Tel Aviv University and Pr(Ai)2R Group challenges the standard approach to interpreting large language models (LLMs). Current methods for understanding what an LLM “thinks” primarily focus on identifying the input data that activates specific features (e.g., neurons or directions in the model’s vector space). The researchers argue this “input-centric” approach is insufficient and propose a more comprehensive “output-centric” method that also considers how feature activation affects the model’s output.

The Problem with Input-Centric Descriptions

Imagine a feature in an LLM that’s associated with the concept of “competition.” An input-centric approach might identify phrases like “fierce competition” or “market rivalry” as activators. However, this approach misses a crucial piece of the puzzle: what does activating this feature actually *do? * Does it make the model more likely to generate text about business deals, sports games, or war? The causal effect on the output is just as important as the inputs. The current methods often fail to capture this causal effect, potentially leading to inaccurate or incomplete descriptions. The paper highlights that relying solely on input examples to describe features can lead to inconsistent descriptions (dependent on the dataset used for training) or even misclassifying features as “dead” (meaning they seem never to be activated).

Output-Centric Methods for Better Understanding

To address this limitation, the researchers propose two new output-centric methods:

  1. VocabProj: This method projects the feature vector onto the model’s vocabulary space. The tokens with the highest weights after this projection are considered the most “prominent” and likely reflect the feature’s impact on output. For example, if the “competition” feature projects strongly onto words like “battle,” “conflict,” or “war,” it suggests the feature’s output effect is related to conflict.

  2. TokenChange: This method measures how much the probabilities of different tokens in the model’s output distribution change when the feature is artificially stimulated (amplified). Tokens whose probabilities increase significantly after stimulation are considered to reflect the feature’s causal influence.

Combining Input and Output for Optimal Results

The researchers demonstrate that the output-centric methods, while more computationally efficient, often outperform input-centric methods in capturing the causal effect on output but might not be as good in identifying inputs. However, combining both input-centric and output-centric descriptions yields the best results on both input-based and output-based evaluations. This suggests that a holistic approach that considers both how inputs activate the feature and how activation affects outputs provides the most accurate understanding.

Reviving “Dead” Features

A surprising finding is that the output-centric methods can be used to identify inputs that activate features previously considered “dead.” This suggests that the limitations of the input-centric method might be masking many important features. This is highly valuable for future research.

Implications for the Field

This paper offers a significant contribution to the field of automated interpretability. By emphasizing the importance of considering the causal effect on output, researchers can develop more accurate and reliable methods for understanding how LLMs represent concepts and generate text. The new output-centric methods offer a computationally efficient approach for expanding our understanding and facilitating more advanced applications of LLMs. The paper’s authors have open-sourced their code and data, inviting further exploration and improvement.