Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
📄 Full Paper
💬 Ask
Foundation models (FMs) are powerful machine learning systems trained on massive amounts of data, making them adept at various tasks like text generation, translation, and question answering. However, despite their vast capabilities, FMs are not always transparent. Understanding what concepts they capture and how they process information is crucial for ensuring their safety and reliability.
Sparse autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, offering a way to peek into their inner workings. However, SAEs face a challenge in capturing rare concepts, which are those that appear infrequently in the training data, and therefore are often overlooked. These concepts, akin to dark matter, can be crucial for understanding and mitigating potential risks associated with FMs.
To address this issue, a research team from Carnegie Mellon University introduces Specialized Sparse Autoencoders (SSAEs), a new method designed to specifically extract rare features related to specific subdomains. This targeted approach avoids the need to scale SAEs to billions of features, which can be computationally expensive and inefficient.
The researchers propose a practical recipe for training SSAEs, which involves three key steps:
- Subdomain Data Selection: This step focuses on selecting a small seed dataset containing specific rare concepts and then leveraging dense retrieval strategies to identify and retrieve relevant examples from the FM’s pretraining corpus. The researchers highlight the efficacy of dense retrieval, a method that uses semantic similarities to find relevant documents, in this process.
- Tilted Empirical Risk Minimization (TERM): TERM is a novel training objective that encourages more balanced learning of head and tail concepts by minimizing the maximum risk rather than the average loss. This approach helps to ensure that even rare concepts are well-represented in the model.
- Finetuning with Sparse Autoencoders: Once relevant data has been selected and TERM is used for training, the final step involves finetuning a general-purpose SAE on the curated subdomain data. This step allows the SAE to specialize and learn features that may be infrequent in the general domain but prevalent within the target subdomain.
To demonstrate the effectiveness of SSAEs, the researchers conducted experiments on two datasets: arXiv Physics and Pile Toxicity. They found that SSAEs trained using dense retrieval and TERM significantly outperformed standard SAEs in capturing tail concepts. Furthermore, the researchers conducted a case study on the Bias in Bios dataset, where SSAEs successfully removed spurious gender information from a classifier, leading to a 12.5% increase in worst-group classification accuracy.
This research demonstrates the potential of SSAEs as a powerful new lens for peering into the inner workings of FMs in subdomains. By focusing on specific subdomains and leveraging TERM, SSAEs can effectively capture rare features and provide valuable insights into the behavior of FMs, contributing to their safety and reliability. This work also highlights the importance of considering ethical implications when developing interpretability methods for FMs, particularly when it comes to handling sensitive attributes and preventing potential misuse.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.