AI Papers Reader

Personalized digests of latest AI research

View on GitHub

Unlocking the Black Box: New Method Decodes Language Model Representations

A novel technique called the “Hyperdimensional Probe” promises to shed light on the inner workings of large language models (LLMs), making them more understandable and debuggable.

Large language models (LLMs) have demonstrated remarkable capabilities, but their decision-making processes remain largely opaque. Current methods for understanding these models, such as Direct Logit Attribution (DLA) and Sparse Autoencoders (SAEs), offer limited insights. DLA is constrained by the LLM’s output vocabulary, while SAEs struggle with vague or overly verbose feature names.

This is where the Hyperdimensional Probe comes in. Developed by researchers at the University of Trento and Fondazione Bruno Kessler, this new paradigm integrates symbolic representations with neural probing to decipher information hidden within LLM vector spaces. It achieves this by mapping the LLM’s internal “residual stream” – a rich representation of processed information – into interpretable concepts using Vector Symbolic Architectures (VSAs).

Think of an LLM’s residual stream as a complex, high-dimensional painting. DLA might only tell you the colors present in the final brushstrokes, while SAEs might describe individual paint blobs but struggle to name them clearly. The Hyperdimensional Probe, however, aims to identify the distinct objects and scenes within that painting by translating the complex data into human-understandable symbols.

How it works in practice:

Imagine an LLM is trying to complete the analogy: “Denmark is to Krone as Mexico is to ?” The model processes this input, and its internal representations are then fed into the Hyperdimensional Probe. The probe uses a trained neural network to convert these internal representations into VSA encodings – structured symbolic representations. These encodings are then analyzed to extract specific concepts, such as “Mexico” and “peso.”

The researchers tested their Hyperdimensional Probe on various LLMs and input types, including syntactic pattern recognition, key-value associations, and abstract inference. They also evaluated it in a question-answering scenario. The results were promising: the probe consistently extracted meaningful concepts across different LLMs and embedding sizes.

Key findings and implications:

  • Enhanced Interpretability: The Hyperdimensional Probe provides a more interpretable way to understand LLM representations compared to existing methods.
  • Overcoming Limitations: It sidesteps the vocabulary constraints of DLA and the feature-naming challenges of SAEs.
  • Identifying LLM Failures: The probe can even help pinpoint where LLMs go wrong, offering insights into why they generate incorrect answers. For instance, in a question-answering task, the probe revealed that when an LLM failed, it often lost focus on the question’s core subject rather than lacking related knowledge.
  • Versatility: The approach is flexible and can be applied to various LLM architectures and tasks, including toxicity detection and bias classification.

This research marks a significant step towards demystifying LLMs, enabling a deeper understanding of their internal reasoning and paving the way for more reliable and controllable AI systems.