October 24th, 2024

Quantized Llama models with increased speed and a reduced memory footprint

Meta has launched lightweight quantized Llama models for mobile devices, achieving 2-4x speedup and 56% size reduction. These models support short-context applications and prioritize on-device data processing for privacy.

Read original articleLink Icon
CuriosityFrustrationAppreciation
Quantized Llama models with increased speed and a reduced memory footprint

Meta has introduced lightweight quantized Llama models designed for enhanced performance on mobile devices. These models, Llama 3.2 1B and 3B, achieve a 2-4x speedup and a 56% reduction in model size, alongside a 41% decrease in memory usage compared to the original BF16 format. The quantization process utilized two techniques: Quantization-Aware Training (QAT) with LoRA adaptors, which focuses on maintaining accuracy, and SpinQuant, a post-training quantization method that emphasizes portability. The models are optimized for deployment on Qualcomm and MediaTek SoCs with Arm CPUs, making them suitable for on-device applications. The quantized models support short-context applications up to 8K and have been tested on devices like the Android OnePlus 12, showing significant improvements in decoding and prefill latency. Meta aims to facilitate easier development with Llama by providing these models to the community, allowing developers to create efficient applications that prioritize privacy by processing data on-device. The company is also exploring the use of Neural Processing Units (NPUs) for further performance enhancements. The Llama 3.2 models are available for download on llama.com and Hugging Face, reflecting Meta's commitment to openness and innovation in AI.

- Meta has released quantized Llama models for mobile devices, improving speed and reducing memory usage.

- The models achieve a 2-4x speedup and a 56% reduction in size compared to the original format.

- Two quantization techniques were used: QAT with LoRA for accuracy and SpinQuant for portability.

- The models are optimized for Qualcomm and MediaTek SoCs, supporting applications with short contexts.

- Llama 3.2 models are available for download, promoting community development and privacy-focused applications.

AI: What people are saying
The comments on Meta's launch of lightweight quantized Llama models reveal several key themes and concerns from the community.
  • Users express curiosity about the specific VRAM requirements and model sizes for running the new models effectively.
  • There are discussions on the performance of the new SpinQuant method compared to existing quantization techniques, with some users sharing personal experiences.
  • Several commenters seek practical advice on using the models in production, including fine-tuning and integration into iOS applications.
  • Concerns are raised about the accuracy and usability of the models in real-world tasks, particularly in structured data handling.
  • Some users question the transparency of Meta's claims regarding open sourcing and the actual availability of training data.
Link Icon 20 comments
By @tveita - 6 months
So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.

Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.

I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.

Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.

As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf

By @nisten - 6 months
It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).

Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.

By @theanonymousone - 6 months
May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/
By @formalsystem - 6 months
Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
By @philipkglass - 6 months
These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?

[1] https://ollama.com/library/llama3.2/tags

By @yuvalr1 - 6 months
Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.

Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?

I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.

By @cmsj - 6 months
It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.
By @ed - 6 months
Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.

It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.

But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.

My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.

Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.

By @Evidlo - 6 months
Why don't they actually say what the size of the model is in GB?

That and average inference times on common hardware is what I'm curious about.

By @itsTyrion - 5 months
Wait, so I can get incorrect information and text summaries with things added or cut off even faster and on mobile now? that's amazing.
By @nikolayasdf123 - 6 months
what's your opinion on LlamaStack?

for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.

is ExecuTorch any better?

By @Tepix - 6 months
From TFA:

> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B

No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

By @justanotheratom - 6 months
Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?
By @behnamoh - 6 months
Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).
By @EliBullockPapa - 6 months
Anyone know a nice iOS app to run these locally?
By @arnaudsm - 6 months
How do they compare to their original quants on ollama like q4_K_S?
By @newfocogi - 6 months
TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).