The Role of Anchor Tokens in Self-Attention Networks
The paper "Anchor-based Large Language Models" presents AnLLMs, which enhance efficiency by compressing sequence information, achieving up to 99% cache reduction and 3.5 times faster inference, while maintaining accuracy.
Read original articleThe paper titled "Anchor-based Large Language Models" introduces a novel approach to enhance the efficiency of large language models (LLMs) by utilizing an anchor-based self-attention network (AnSAN) and an anchor-based inference strategy. Traditional LLMs, which primarily use decoder-only transformer architectures, face significant memory demands due to the need to retain keys and values for historical tokens. This requirement escalates with longer input texts, prompting the need for more efficient information storage and processing methods. The proposed Anchor-based LLMs (AnLLMs) compress sequence information into an anchor token, which significantly reduces the keys/values cache by up to 99% and accelerates inference speed by up to 3.5 times, while maintaining comparable accuracy levels on question-answering benchmarks. Although there is a slight compromise in accuracy, the improvements in resource utilization and computational efficiency highlight the potential of AnLLMs for practical applications in the field of natural language processing. The research has been accepted for presentation at the ACL2024 conference.
- The study presents Anchor-based LLMs (AnLLMs) to improve efficiency in large language models.
- AnLLMs utilize an anchor-based self-attention network to compress sequence information.
- The approach achieves up to 99% reduction in keys/values cache and 3.5 times faster inference.
- Accuracy levels remain similar to traditional models despite minor compromises.
- The research has been accepted for the ACL2024 conference, indicating its relevance in the field.