"Softpick" Offers a New Take on Transformer Attention, Reducing Sinks and Massive Activations

🔊

💬 Ask

Researchers at MBZUAI have introduced a new attention mechanism called “softpick” that aims to address some limitations of the widely used softmax function in transformer models. The paper, titled “Softpick: No Attention Sink, No Massive Activations with Rectified Softmax,” claims that softpick can eliminate “attention sinks” and reduce the occurrence of “massive activations,” potentially leading to more efficient and interpretable models.

The softmax function is commonly used to normalize attention scores in transformers, turning them into probabilities that sum to one. However, this can lead to issues. One is the “attention sink” phenomenon, where certain tokens (often the first token in a sequence) receive disproportionately high attention scores, even if they are not semantically important. Imagine reading a paragraph and always focusing intensely on the very first word, even if it’s just “The.”

Another issue is “massive activations,” extremely large values in the hidden states of the transformer. These can cause problems for quantization (reducing the number of bits used to represent the model’s parameters), making it harder to create smaller, faster models. Think of it like having a volume knob that sometimes blasts the sound to 11, overwhelming the speakers and making it difficult to hear the nuanced parts of the music.

Softpick aims to fix these issues with a modification of the softmax function. Instead of forcing all attention scores to sum to one, it allows for rectified outputs. In other words, softpick allows attention heads to output zeros, which can promote sparsity and reduce noise. The core of the softpick equation involves a Rectified Linear Unit or ReLU. Essentially if the score is low, the function results in zero attention. This creates “sparse” attention maps. It mitigates both attention sink and massive activations by reshaping the softmax function to avoid strict sum-to-one behavior and allowing rectified outputs

The researchers tested softpick in 340-million-parameter transformer models trained from scratch. They report that softpick maintains performance parity with softmax on standard benchmarks. Critically, the models using softpick achieved a 0% sink rate and significantly lower kurtosis (a measure of the “tailedness” of the distribution of hidden states, indicating fewer extreme values). Also, Softpick attention maps exibited sparsity of 46.97%. They also found that models using softpick consistently outperform softmax when quantized, particularly at lower bit precisions. This suggests that softpick could be especially beneficial for deploying models on resource-constrained devices.

The attention maps produced by softpick are also described as “more legible” because it allows for functionally sparse and concentrated attentions. The authors believe that softpick could lead to new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. The code for softpick is available on GitHub.

However, the authors acknowledge one potential issue: softpick can lead to underscaled scores in long contexts, potentially affecting retrieval performance. They are actively working to address this limitation.

This research offers a promising alternative to the standard softmax function in transformers, potentially paving the way for more efficient, interpretable, and easily quantized models.

AI Papers Reader

Personalized digests of latest AI research

"Softpick" Offers a New Take on Transformer Attention, Reducing Sinks and Massive Activations

Chat about this paper