October 11th, 2024

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.

Read original articleLink Icon
ARIA: An Open Multimodal Native Mixture-of-Experts Model

Aria is an open multimodal native mixture-of-experts model introduced by a team of researchers led by Dongxu Li. This model aims to address the limitations of proprietary multimodal AI models by providing an open-source alternative that integrates diverse modalities for comprehensive understanding. Aria boasts impressive performance metrics, with 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively. It outperforms existing models such as Pixtral-12B and Llama3.2-11B, and competes effectively with top proprietary models across various multimodal tasks. The model was pre-trained from scratch using a four-stage pipeline that enhances its capabilities in language understanding, multimodal comprehension, long context processing, and instruction following. The researchers have made the model weights and a supporting codebase available for public use, facilitating easy adoption and adaptation in real-world applications.

- Aria is an open-source multimodal AI model designed to integrate diverse information modalities.

- It features 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively.

- The model outperforms several existing proprietary models in various multimodal tasks.

- Aria was pre-trained using a four-stage pipeline to enhance its capabilities.

- The model weights and codebase are available for public use, promoting accessibility and adaptability.

Link Icon 6 comments
By @cantSpellSober - 4 months
> outperforms Pixtral-12B and Llama3.2-11B

Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.

By @theanonymousone - 4 months
In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?
By @niutech - 4 months
I’m curious how it compares with recently announced Molmo: https://molmo.org/
By @petemir - 4 months
Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.

[0] https://rhymes.ai/

By @vessenes - 4 months
This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.
By @SomewhatLikely - 4 months
"Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."