Quantized Llama models with increased speed and a reduced memory footprint
Meta has launched lightweight quantized Llama models for mobile devices, achieving 2-4x speedup and 56% size reduction. These models support short-context applications and prioritize on-device data processing for privacy.
Read original articleMeta has introduced lightweight quantized Llama models designed for enhanced performance on mobile devices. These models, Llama 3.2 1B and 3B, achieve a 2-4x speedup and a 56% reduction in model size, alongside a 41% decrease in memory usage compared to the original BF16 format. The quantization process utilized two techniques: Quantization-Aware Training (QAT) with LoRA adaptors, which focuses on maintaining accuracy, and SpinQuant, a post-training quantization method that emphasizes portability. The models are optimized for deployment on Qualcomm and MediaTek SoCs with Arm CPUs, making them suitable for on-device applications. The quantized models support short-context applications up to 8K and have been tested on devices like the Android OnePlus 12, showing significant improvements in decoding and prefill latency. Meta aims to facilitate easier development with Llama by providing these models to the community, allowing developers to create efficient applications that prioritize privacy by processing data on-device. The company is also exploring the use of Neural Processing Units (NPUs) for further performance enhancements. The Llama 3.2 models are available for download on llama.com and Hugging Face, reflecting Meta's commitment to openness and innovation in AI.
- Meta has released quantized Llama models for mobile devices, improving speed and reducing memory usage.
- The models achieve a 2-4x speedup and a 56% reduction in size compared to the original format.
- Two quantization techniques were used: QAT with LoRA for accuracy and SpinQuant for portability.
- The models are optimized for Qualcomm and MediaTek SoCs, supporting applications with short contexts.
- Llama 3.2 models are available for download, promoting community development and privacy-focused applications.
Related
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
An update on Llama adoption
Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.
Llama 3.2 released: Multimodal, 1B to 90B sizes
Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.
Llama can now see and run on your device – welcome Llama 3.2
Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.
- Users express curiosity about the specific VRAM requirements and model sizes for running the new models effectively.
- There are discussions on the performance of the new SpinQuant method compared to existing quantization techniques, with some users sharing personal experiences.
- Several commenters seek practical advice on using the models in production, including fine-tuning and integration into iOS applications.
- Concerns are raised about the accuracy and usability of the models in real-world tasks, particularly in structured data handling.
- Some users question the transparency of Meta's claims regarding open sourcing and the actual availability of training data.
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?
I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
That and average inference times on common hardware is what I'm curious about.
for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.
is ExecuTorch any better?
> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B
No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!
Related
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
An update on Llama adoption
Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.
Llama 3.2 released: Multimodal, 1B to 90B sizes
Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.
Llama can now see and run on your device – welcome Llama 3.2
Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.