Gated Slot Attention for Efficient Linear-Time Sequence Modeling

💬 Ask

Linear attention models, which offer efficient parallel training and linear-time recurrent inference, are becoming increasingly popular for large language models (LLMs). However, these models still struggle with tasks that require long-term memory, such as those involving in-context retrieval.

Recent research has shown that “finetuning pretrained Transformers to RNNs” (T2R) is a promising approach to circumvent the high cost of training linear models from scratch. However, linear attention methods often have trouble with memory capacity and recall-intensive tasks, which are crucial for T2R scenarios.

A new paper, “Gated Slot Attention for Efficient Linear-Time Sequence Modeling,” introduces a novel linear attention model called Gated Slot Attention (GSA) that addresses these limitations. GSA leverages a gating mechanism to selectively retain and forget information in its memory, resulting in a more efficient and robust model.

The paper compares GSA to several other linear attention models, including RetNet, GLA, and HGRN2, on a variety of tasks, including language modeling, commonsense reasoning, and in-context retrieval. The results show that GSA consistently outperforms these other models, both in terms of performance and efficiency.

For instance, GSA achieved a perplexity of 12.6 on the WikiText dataset with 1.3B parameters, outperforming HGRN2, which achieved a perplexity of 11.8 with the same number of parameters. Furthermore, in a T2R setting, finetuning Mistral-7B to GSA resulted in better performance on a variety of tasks, such as question answering, compared to finetuning to other linear attention models.

GSA’s gating mechanism is also particularly beneficial for in-context retrieval, where models need to remember information from previous parts of the sequence. In a synthetic task designed to evaluate in-context recall, GSA achieved significantly higher accuracy than other linear attention models.

The paper concludes that GSA is a promising new approach to linear attention that offers both improved performance and efficiency. The authors believe that GSA has the potential to be a valuable tool for training and deploying LLMs that require large memory capacity and perform well on recall-intensive tasks.

The authors also emphasize that their work is still in its early stages and that there is room for further improvement. They plan to investigate the performance of GSA on larger-scale models and in more challenging tasks, such as those involving long-range dependencies. They also plan to explore the use of GSA in combination with other techniques, such as hybrid models, to further enhance its performance and capabilities.

This research is important for the development of LLMs because it provides a new approach to linear attention that addresses many of its limitations. By leveraging the strengths of both softmax attention and linear attention, GSA has the potential to enable the training of more efficient and powerful LLMs that can handle a wider range of tasks.

AI Papers Reader

Personalized digests of latest AI research

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Chat about this paper