Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.
Read original articleMeta has announced the release of Llama 3.2, which introduces small and medium-sized vision large language models (LLMs) with 11B and 90B parameters, as well as lightweight text-only models with 1B and 3B parameters. These models are designed for edge and mobile devices, supporting a context length of 128K tokens and optimized for Qualcomm and MediaTek hardware. The vision models are capable of advanced image understanding tasks, such as document-level comprehension and visual grounding, while the lightweight models excel in multilingual text generation and tool calling. Llama 3.2 models are available for download on llama.com and Hugging Face, and can be developed on various partner platforms, including AWS and Google Cloud. The release also includes the Llama Stack, which simplifies the deployment of these models across different environments. Meta emphasizes the importance of openness and collaboration in driving innovation, and the new models are positioned as competitive alternatives to closed models. The Llama 3.2 models have undergone extensive evaluation, demonstrating strong performance in image recognition and visual reasoning tasks. The release aims to empower developers with accessible tools for building applications that prioritize privacy and efficiency.
- Llama 3.2 includes vision models (11B, 90B) and lightweight text models (1B, 3B) for edge devices.
- Models support a context length of 128K tokens and are optimized for Qualcomm and MediaTek hardware.
- Llama Stack simplifies deployment across various environments, enhancing developer accessibility.
- The models demonstrate competitive performance in image understanding and multilingual text generation.
- Meta emphasizes openness and collaboration to foster innovation in AI development.
Related
Llama 3.1 Official Launch
Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
Meta Llama 3.1 405B
The Meta AI team unveils Llama 3.1, a 405B model optimized for dialogue applications. It competes well with GPT-4o and Claude 3.5 Sonnet, offering versatility and strong performance in evaluations.
An update on Llama adoption
Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.
Llama 3.2 released: Multimodal, 1B to 90B sizes
Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.
- Users are impressed with the performance of the smaller models (1B and 3B), noting their efficiency and ability to handle complex tasks.
- There are mixed reviews regarding the larger vision models (11B and 90B), with some users finding them less effective compared to competitors.
- Many users appreciate Meta's openness in sharing model details and deployment options, fostering a collaborative environment.
- Concerns about accessibility and performance on various devices, particularly for the larger models, are frequently mentioned.
- Some users express curiosity about the models' capabilities in specific applications, such as code assistance and multilingual support.
I tried running a full codebase through it (since it can handle 128,000 tokens) and asking it to summarize the code - it did a surprisingly decent job, incomplete but still unbelievable for a model that tiny: https://gist.github.com/simonw/64c5f5b111fe473999144932bef42...
More of my notes here: https://simonwillison.net/2024/Sep/25/llama-32/
I've been trying out the larger image models to using the versions hosted on https://lmarena.ai/ - navigate to "Direct Chat" and you can select them from the dropdown and upload images to run prompts.
With 1-hot encoding, the answer is "wall", with 100% probability. Oh, you gave plausibility to "fence" too? WRONG! ENJOY MORE PENALTY, SCRUB!
I believe this unforgiving dynamic is why model distillation works well. The original teacher model had to learn via the "hot or cold" game on text answers. But when the child instead imitates the teacher's predictions, it learns semantically rich answers. That strikes me as vastly more compute-efficient. So to me, it makes sense why these Llama 3.2 edge models punch so far above their weight(s). But it still blows my mind thinking how far models have advanced from a year or two ago. Kudos to Meta for these releases.
For anyone looking for a simple way to test Llama3.2 3B locally with UI, Install nexa-sdk(https://github.com/NexaAI/nexa-sdk) and type in terminal:
nexa run llama3.2 --streamlit
Disclaimer: I am from Nexa AI and nexa-sdk is an open-sourced. We'd love your feedback.
- The 1B is extremely coherent (feels something like maybe Mistral 7B at 4 bits), and with flash attention and 4 bit KV cache it only uses about 4.2 GB of VRAM for 128k context
- A Pi 5 runs the 1B at 8.4 tok/s, haven't tested the 3B yet but it might need a lower quant to fit it and with 9T training tokens it'll probably degrade pretty badly
- The 3B is a certified Gemma-2-2B killer
Given that llama.cpp doesn't support any multimodality (they removed the old implementation), it might be a while before the 11B and 90B become runnable. Doesn't seem like they outperform Qwen-2-VL at vision benchmarks though.
It's super fast with a lot of knowledge, a large context and great understanding. Really impressive model.
I just removed my install of 3.1-8b.
my ollama list is currently:
$ ollama list
NAME ID SIZE MODIFIED
llama3.2:3b-instruct-q8_0 e410b836fe61 3.4 GB 2 hours ago
gemma2:9b-instruct-q4_1 5bfc4cf059e2 6.0 GB 3 days ago
phi3.5:3.8b-mini-instruct-q8_0 8b50e8e1e216 4.1 GB 3 days ago
mxbai-embed-large:latest 468836162de7 669 MB 3 months ago
It gets "which is larger: 9.11 or 9.9?" right if it manages to mention that decimals need to be compared first in its step-by-step thinking. If it skips mentioning decimals, then it says 9.11 is larger.
It gets the strawberry question wrong even after enumerating all the letters correctly, probably because it can't properly count.
The 7/8B models are great for poc and moving to edge for minor use cases … but there’s a big and empty gap till 70b that most people can’t run.
The tin foil hat in me is saying this is the compromise the powers that be have agreed too. Basically being “open” but practically gimped for average joe techie. Basically arms control
Livebench and Lmsys are weeks behind and sometimes refuse to add some major models. And press releases like this cherry pick their benchmarks and ignore better models like qwen2.5.
If it doesn't exist I'm willing to create it
Do they require GPU or can they be deployed on VPS with dedicated CPU?
I'd like to see more head-to-head comparisons with community created multi-modal LLMs as done in these papers:
https://arxiv.org/abs/2408.05334
https://arxiv.org/abs/2408.03326
I look forward to reading the technical report, once its available. I couldn't find a link to one, yet.
But, this is relevant because I'm curious how this new model allows image inputs. Do you paste a base64 image into the prompt?
It feels like these models can start not only providing the text generation backend, but start to replace the infrastructure for the API as well.
Can you input images without something in front of it like openwebui?
Next I want to try to use Aider with it and see how this would work.
Literally everything about the quote below is wrong. The problem with these "amazing ground-breaking models" is that you never know which knowledge they actually possess.
--- start quote ---
>>> what is the last version of Ash Framework for Elixir you know My training data goes up to December 2023, and according to the Ash Framework releases, the last stable version that I'm familiar with is:
* *Ash 2.14*: Released on April 8, 2021.
However, it's worth noting that Ash has been merged into Elixir itself, and most of its features have been integrated into the Elixir language and standard libraries.
In fact, Ash was one of the first frameworks to be included in Elixir, and it played a significant role in shaping the language. Many of Ash's core concepts, such as `defstruct/1` and `defmodule/1`, were later adopted by Elixir itself.
...
Keep in mind that Ash is no longer actively maintained as a separate framework, but its legacy continues to influence Elixir and its community.
--- end quote ---
TL;DR: * 90B-Vision: 4.3% hallucination rate * 11B-Vision: 5.5% hallucination rate
> Llama 3.2 Multimodal is not available in your region.
It sounds like they input the continuous output of an image encoder into a transformer, similar to transfusion[0]? Does someone know where to find more details?
Edit:
> Regarding the licensing terms, Llama 3.2 comes with a very similar license to Llama 3.1, with one key difference in the acceptable use policy: any individual domiciled in, or a company with a principal place of business in, the European Union is not being granted the license rights to use multimodal models included in Llama 3.2. [1]
What a bummer.
0. https://www.arxiv.org/abs/2408.11039
1. https://huggingface.co/blog/llama32#llama-32-license-changes...
If there's an algorithmic penalty against the news for whatever reason, that may be a flaw in the HN ranking algorithm.
- The 11B and 90B vision models are competitive with leading closed models like Claude 3 Haiku on image understanding tasks, while being open and customizable.
- Llama 3.2 comes with official Llama Stack distributions to simplify deployment across environments (cloud, on-prem, edge), including support for RAG and safety features.
- The lightweight 1B and 3B models are optimized for on-device use cases like summarization and instruction following.
He's hoping to control AI as the next platform through which users interact with apps. Free AI is then fine if the surplus value created by not having a gatekeeper to his apps exceeds the cost of the free AI.
That's the strategy. No values here - just strategy folks.
Could someone try giving the 90b model this word search problem [0] and tell me how it performs? So far with every model I've tried, none has ever managed to find a single word correctly.
Related
Llama 3.1 Official Launch
Llama introduces Llama 3.1, an open-source AI model available in 8B, 70B, and 405B versions. The 405B model is highlighted for its versatility in supporting various use cases, including multi-lingual agents and analyzing large documents. Users can leverage coding assistants, real-time or batch inference, and fine-tuning capabilities. Llama emphasizes open-source AI and offers subscribers updates via a newsletter.
Llama 3.1: Our most capable models to date
Meta has launched Llama 3.1 405B, an advanced open-source AI model supporting diverse languages and extended context length. It introduces new features like Llama Guard 3 and aims to enhance AI applications with improved models and partnerships.
Meta Llama 3.1 405B
The Meta AI team unveils Llama 3.1, a 405B model optimized for dialogue applications. It competes well with GPT-4o and Claude 3.5 Sonnet, offering versatility and strong performance in evaluations.
An update on Llama adoption
Llama, Meta's language model, has surpassed 350 million downloads, with significant growth in usage and adoption among major companies, driven by its open-source nature and recent enhancements.
Llama 3.2 released: Multimodal, 1B to 90B sizes
Llama 3.2 has been released as an open-source AI model in various sizes for text and image processing, enhancing application development and gaining significant traction with over 350 million downloads.