November 26th, 2024

What happens if we remove 50 percent of Llama?

Neural Magic launched Sparse Llama 3.1, a sparse model from Meta's Llama 3.1, achieving 98% accuracy recovery with 50% fewer parameters, optimized for NVIDIA GPUs, enhancing throughput and latency significantly.

Read original articleLink Icon
What happens if we remove 50 percent of Llama?

Neural Magic has introduced Sparse Llama 3.1, a sparse foundation model derived from Meta's Llama 3.1 8B, designed for efficient GPU inference. This model employs a 2:4 sparsity pattern, which prunes 50% of the parameters while maintaining high accuracy, achieving 98% recovery on the Open LLM Leaderboard v1. Sparse Llama is optimized for NVIDIA Ampere GPUs, providing up to 30% higher throughput and 1.8x lower latency. It also supports advanced 4-bit quantization methods, enhancing inference speed by 1.4x to 4.9x depending on the hardware. The model's development builds on previous research, including SparseGPT and Sparse Llama 2, focusing on reducing environmental impact by curating a high-quality dataset of 13 billion tokens. Performance evaluations show Sparse Llama excelling in few-shot benchmarks and fine-tuning tasks across various domains, including mathematics and coding, often outperforming dense models. The model's architecture allows for significant inference speedups, with results indicating a 5.0x speedup on A5000 GPUs. Sparse Llama aims to make advanced AI more accessible and efficient, encouraging community engagement through open-source resources.

- Sparse Llama 3.1 achieves 98% accuracy recovery while reducing model size by 50%.

- The model is optimized for NVIDIA GPUs, offering significant improvements in throughput and latency.

- It supports advanced quantization techniques, enhancing inference speed across various hardware.

- Sparse Llama demonstrates strong performance in fine-tuning tasks, often surpassing dense model counterparts.

- The initiative aims to promote efficient AI deployment and community collaboration through open-source access.

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Llama 3.1 Official Launch

Llama 3.1 Official Launch

Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.

Llama 3.1: Our most capable models to date

Llama 3.1: Our most capable models to date

Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.

Llama 3 Secrets Every Engineer Must Know

Llama 3 Secrets Every Engineer Must Know

Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.

Quantized Llama models with increased speed and a reduced memory footprint

Quantized Llama models with increased speed and a reduced memory footprint

Meta has launched lightweight quantized Llama models for mobile devices, achieving 2-4x speedup and 56% size reduction. These models support short-context applications and prioritize on-device data processing for privacy.

Link Icon 18 comments
By @celltalk - 4 months
All of these smaller model paradigm suggests that we need to incorporate pruning into model training. Neat was one of my favorite algorithms of all time. Same thing with BitNet models which keep showing the information you need is not that much for neural networks. And again, it is same with us, we use much less energy than a regular network so there seems to be immense waste of energy training these models.

My intiution tells me the pre-training paradigm will shift immensely in near future because we started to understand that we don’t need all these paramaters since the subnetworks seems to be very robust preserving information in high dimensions. We keep saying curse of dimensionality but it is more like the bliss of dimensionality we keep seeing. Network redundancy still seems to be very high given BitNet is more less comparable to other LLMs.

This basically shows over 50% of the neural net is gibberish! The reason being is that the objective function simply does not include it.

Again my intiution tells me that neural scaling laws are incomplete as they are because they lack the efficiency parameter that needs to be taken into account (or simply left out due to greed of corporate).

And this is what we are seeing as “the wall”.

I am no expert in neural network theory nor in math but I would assume the laws should be something in the vicinity of this formulation/simulation:

https://colab.research.google.com/drive/1xkTMU2v1I-EHFAjoS86...

and encapsulate shannon’s channel’s capacity. I call them generalized scaling laws since it includes what it should include in the first place: entropy.

By @fxj - 4 months
After reading the article it seems to me that this is more like synaptic pruning where weak connections between neurons are eliminated in order to increase the efficiency of the neurons. Interesting to see that this also works for LLMs.

https://en.wikipedia.org/wiki/Synaptic_pruning

By @agroot12 - 4 months
I might be missing something, but it would be great if the charts would show inference speed, model size (required VRAM) and quality (benchmark results) in one. It might be that the same quality and speed and size can be attained by just quantizing, perhaps with added fine-tuning, without the sparseness. The post seems to imply that their method is better, but if that's the case, they could show that.
By @devsda - 4 months
I don't understand LLMs enough to know if this is a silly question or not.

Is it possible to build domain specific smaller models and merge/combine them at query/run time to give better response or performance instead of one large all knowing model that learns everything ?

By @jbverschoor - 4 months
LLobotoMy
By @slaucon - 4 months
> “By sourcing and filtering only the highest-quality and most representative data for LLM use cases, we reduced the pretraining set to just 13 billion tokens—drastically cutting the environmental impact of further training while preserving performance.”

Would love to know more about how they filtered the training set down here and what heuristics were involved.

I think that the models we use now are enormous for the use cases we’re using them for. Work like this and model distillation in general is fantastic and sorely needed, both to broaden price accessibility and to decrease resource usage.

I’m sure frontier models will only get bigger, but I’d be shocked if we keep using the largest models in production for almost any use case.

By @david-gpu - 4 months
For those curious, NVidia and Cerebras have been doing R&D in sparse neural nets for something like a decade. NVidia began adding hardware support for them several generations ago (Ampere).

It is significantly more complex than it appears at first sight.

By @ssalka - 5 months
Surprising that the retained accuracy is so high after removing 1/2 of parameters. Does this help with being able to run inference on low-end GPUs?
By @zug_zug - 4 months
Curios if anybody can explain what a 2:4 sparsity pattern is. Are the 2 to be removed picked randomly?
By @sorenjan - 4 months
Is it possible to rearrange a sparse matrix into a smaller dense matrix? Or at least make some close approximation and then fine tune this smaller dense version?
By @drdaeman - 4 months
I'm curious - what happens if one prunes the halved model again (if that's possible with the same method), would it start losing accuracy?
By @andycowley - 4 months
It would fall over
By @reify - 4 months
Two legs, half a head, and enough wool to make a small knitted jumper
By @MrGuts - 5 months
You do know that AI's are reading this stuff, right?

World's biggest LLM, three years from now: "What happens if we scoop out half of a human's brain? Probably not anything significant."

By @v3ss0n - 4 months
2 percentage is really big. Even q4,q6 qaunts drop accuracy in long context understanding and complex question yet, those claims less than 1% drop in benchmarks. This would give LLM functioning autism