AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Language Models as Causal Systems: Inferring Counterfactuals to Understand How Interventions Work

Language models (LMs) are increasingly used for tasks that require understanding and reasoning about the world. To ensure that these models behave as we intend, we need to understand their inner workings. This paper proposes a novel framework for understanding and manipulating the causal generation mechanisms in language models.

While previous work has primarily focused on intervening on LMs to alter their behavior, the authors highlight that this approach is insufficient for understanding how interventions affect a model’s output. The authors argue that counterfactuals – scenarios where we imagine how a model would have generated a sentence under a specific intervention – provide a more insightful approach for evaluating interventions.

To generate true counterfactual strings, the paper reformulates LMs as Generalized Structural Equation Models (GSEMs), a statistical framework that captures the joint distribution over original strings and their counterfactual counterparts. The authors leverage the Gumbel-max trick to separate the deterministic computation of next-symbol logits from the sampling process, thereby allowing us to model the latent noise variables and generate counterfactuals of observed strings.

They then showcase the utility of their framework by applying it to various commonly used intervention techniques, such as knowledge editing, linear steering, and instruction tuning. They demonstrate that even interventions that are designed to be “minimal”, can have considerable unintended side effects. For example, they find that interventions aimed at changing a model’s knowledge about the location of the Louvre can unexpectedly influence text completions unrelated to location.

The authors also provide evidence that commonly used intervention techniques often fail to achieve their intended targeted effects. They highlight, for example, that gender-based interventions aimed at generating more feminine text can unexpectedly influence text completions unrelated to gender.

This work presents a novel approach for generating true counterfactuals from language models, opening up new possibilities for understanding and manipulating their causal generation mechanisms. The authors show that their approach provides a more precise and insightful tool for evaluating interventions and understanding how interventions can impact model behavior. Their findings suggest that caution is needed when using interventions to modify model behavior, as they may have unintended consequences that are not immediately obvious.

This research represents a significant step towards understanding the causal properties of language models and provides a critical foundation for building more reliable and robust language models for tasks that require understanding and reasoning about the world.