New AI Metrics Reveal Progress in Language Model Interpretability
đź“„ Full Paper
đź’¬ Ask
Artificial intelligence researchers are working hard to understand how language models work. One approach is to try and decipher the latent features that language models learn. In other words, they want to understand what internal representations the model creates to understand language.
One recent approach uses sparse autoencoders (SAEs) to extract interpretable features from language models. SAEs are a type of neural network that learns to represent data in a more compact and efficient way. However, evaluating the quality of these SAEs is difficult because there’s no clear standard to compare them to.
Now, researchers from the University of Massachusetts Amherst have made significant progress in evaluating SAEs by training them on chess and Othello game transcripts. This work, published on the preprint server arXiv, has two key contributions:
1. New metrics for evaluating SAEs. The researchers identified two categories of interpretable features in chess and Othello, which they can use to evaluate the quality of SAEs:
- Board Reconstruction: Can the SAE reconstruct the game board state by interpreting its features as classifiers for different board configurations?
- Coverage: How many of the pre-defined interpretable features does the SAE actually capture?
These new metrics are a significant step forward, because they provide a more objective way to evaluate SAEs, rather than relying on subjective proxies such as sparsity or reconstruction fidelity.
2. A new training technique for SAEs: p-annealing. Traditional SAEs are trained with a sparsity penalty that encourages features to be small. The researchers found that a new method called p-annealing, which gradually changes the sparsity penalty during training, resulted in SAEs that achieved performance on par with other, more computationally expensive, methods.
To illustrate how the new metrics work, consider the example of a board state property like “there is a knight on F3.” A good SAE would be able to reliably identify this feature, with a high degree of precision (the SAE should only activate when there is a knight on F3, and not in other situations).
The researchers’ work is a major step forward in the field of language model interpretability. The new metrics and training techniques provide a more rigorous way to evaluate SAEs and will hopefully lead to the development of even more powerful and interpretable language models in the future.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.