October 24th, 2024

Quantized Llama models with increased speed and a reduced memory footprint

Meta has launched lightweight quantized Llama models for mobile devices, achieving 2-4x speedup and 56% size reduction. These models support short-context applications and prioritize on-device data processing for privacy.

Read original article

CuriosityFrustrationAppreciation

Quantized Llama models with increased speed and a reduced memory footprint

Meta has introduced lightweight quantized Llama models designed for enhanced performance on mobile devices. These models, Llama 3.2 1B and 3B, achieve a 2-4x speedup and a 56% reduction in model size, alongside a 41% decrease in memory usage compared to the original BF16 format. The quantization process utilized two techniques: Quantization-Aware Training (QAT) with LoRA adaptors, which focuses on maintaining accuracy, and SpinQuant, a post-training quantization method that emphasizes portability. The models are optimized for deployment on Qualcomm and MediaTek SoCs with Arm CPUs, making them suitable for on-device applications. The quantized models support short-context applications up to 8K and have been tested on devices like the Android OnePlus 12, showing significant improvements in decoding and prefill latency. Meta aims to facilitate easier development with Llama by providing these models to the community, allowing developers to create efficient applications that prioritize privacy by processing data on-device. The company is also exploring the use of Neural Processing Units (NPUs) for further performance enhancements. The Llama 3.2 models are available for download on llama.com and Hugging Face, reflecting Meta's commitment to openness and innovation in AI.

- Meta has released quantized Llama models for mobile devices, improving speed and reducing memory usage.

- The models achieve a 2-4x speedup and a 56% reduction in size compared to the original format.

- Two quantization techniques were used: QAT with LoRA for accuracy and SpinQuant for portability.

- The models are optimized for Qualcomm and MediaTek SoCs, supporting applications with short contexts.

- Llama 3.2 models are available for download, promoting community development and privacy-focused applications.

Llama 3.1: Our most capable models to date

Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.

An update on Llama adoption

Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.

Llama 3.2 released: Multimodal, 1B to 90B sizes

Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.

Llama can now see and run on your device – welcome Llama 3.2

Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.

AI: What people are saying

The comments on Meta's launch of lightweight quantized Llama models reveal several key themes and concerns from the community.

Users express curiosity about the specific VRAM requirements and model sizes for running the new models effectively.
There are discussions on the performance of the new SpinQuant method compared to existing quantization techniques, with some users sharing personal experiences.
Several commenters seek practical advice on using the models in production, including fine-tuning and integration into iOS applications.
Concerns are raised about the accuracy and usability of the models in real-world tasks, particularly in structured data handling.
Some users question the transparency of Meta's claims regarding open sourcing and the actual availability of training data.

20 comments

By @tveita - 6 months

So SpinQuant learns a rotation for activations and weights that, to my understanding, "smear" the outlier weights out so you don't get extreme values in any one weight.

Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.

I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.

Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.

As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.

[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf

By @nisten - 6 months

It's pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).

Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.

By @theanonymousone - 6 months

May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :/

By @formalsystem - 6 months

Hi I'm Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!

By @philipkglass - 6 months

These quantized models show much less degradation compared to a "vanilla post-training-quantization" but there are a bunch of PTQ schemes that people have already applied to Llama models [1]. I didn't see any details about the vanilla PTQ they used as a baseline. Has it been written about elsewhere?

[1] https://ollama.com/library/llama3.2/tags

By @yuvalr1 - 6 months

Looking at how to deploy 1B and 3B Llama models on Android for inference. Some posts online recommend using Termux (an amazing app) to have an emulated shell and then install as if it's Linux, using ollama for example. However, this forces you into a manual installation process, and also most of the people don't know what Termux is, and would be afraid to install it from F-Droid.

Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?

I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.

By @cmsj - 6 months

It really bugs me that every time I see posts about new models, there is never any indication of how much VRAM one needs to actually run them.

By @ed - 6 months

Oh cool! I’ve been playing with quantized llama 3B for the last week. (4-bit spinquant). The code for spinquant has been public for a bit.

It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.

But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.

My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.

Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.

By @Evidlo - 6 months

Why don't they actually say what the size of the model is in GB?

That and average inference times on common hardware is what I'm curious about.

By @itsTyrion - 5 months

Wait, so I can get incorrect information and text summaries with things added or cut off even faster and on mobile now? that's amazing.

By @nikolayasdf123 - 6 months

what's your opinion on LlamaStack?

for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.

is ExecuTorch any better?

By @Tepix - 6 months

From TFA:

> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B

No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!

By @justanotheratom - 6 months

Any pointers no how to finetune this on my dataset and package and run it in my swift ios app?

By @behnamoh - 6 months

Does anyone know why the most common method to speed up inference time is quantization? I keep hearing about all sorts of new methods but nearly none of them is implemented in practice (except for flash attention).

By @EliBullockPapa - 6 months

Anyone know a nice iOS app to run these locally?

By @arnaudsm - 6 months

How do they compare to their original quants on ollama like q4_K_S?

By @newfocogi - 6 months

TLDR: Quantized versions of Llama 3.2 1B and 3B models with "competitive accuracy" to the original versions (meaning some degraded performance; plots included in the release notes).

Llama 3.1: Our most capable models to date

An update on Llama adoption

Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.

Quantized Llama models with increased speed and a reduced memory footprint

Related

Llama 3.1: Our most capable models to date

An update on Llama adoption

Llama 3.2 released: Multimodal, 1B to 90B sizes

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Llama can now see and run on your device – welcome Llama 3.2

Related

Llama 3.1: Our most capable models to date

An update on Llama adoption

Llama 3.2 released: Multimodal, 1B to 90B sizes

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Llama can now see and run on your device – welcome Llama 3.2