Q-Sparse: All Large Language Models Can Be Fully Sparsely-Activated
The paper introduces Q-Sparse, a method for training sparsely-activated large language models, achieving full sparsity in activations for efficiency gains during inference. Q-Sparse is effective across various LLM settings, including full-precision and 1-bit models like BitNet b1.58, promising enhanced efficiency and reduced costs.
Read original articleThe paper titled "Q-Sparse: All Large Language Models can be Fully Sparsely-Activated" introduces a method called Q-Sparse for training sparsely-activated large language models (LLMs). This approach enables full sparsity of activations in LLMs, leading to significant efficiency gains during inference. By applying top-K sparsification to activations and the straight-through-estimator to training, Q-Sparse achieves results comparable to baseline LLMs while being more efficient at inference. The study presents an inference-optimal scaling law for sparsely-activated LLMs and demonstrates the effectiveness of Q-Sparse in various settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning. Moreover, Q-Sparse is shown to work for both full-precision and 1-bit LLMs, such as BitNet b1.58, potentially revolutionizing the efficiency, cost, and energy consumption of future LLMs when combined with MoE.