October 11th, 2024

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.

Read original article

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Aria is an open multimodal native mixture-of-experts model introduced by a team of researchers led by Dongxu Li. This model aims to address the limitations of proprietary multimodal AI models by providing an open-source alternative that integrates diverse modalities for comprehensive understanding. Aria boasts impressive performance metrics, with 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively. It outperforms existing models such as Pixtral-12B and Llama3.2-11B, and competes effectively with top proprietary models across various multimodal tasks. The model was pre-trained from scratch using a four-stage pipeline that enhances its capabilities in language understanding, multimodal comprehension, long context processing, and instruction following. The researchers have made the model weights and a supporting codebase available for public use, facilitating easy adoption and adaptation in real-world applications.

- Aria is an open-source multimodal AI model designed to integrate diverse information modalities.

- It features 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively.

- The model outperforms several existing proprietary models in various multimodal tasks.

- Aria was pre-trained using a four-stage pipeline to enhance its capabilities.

- The model weights and codebase are available for public use, promoting accessibility and adaptability.

MIT researchers advance automated interpretability in AI models

MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.

Llama 3 Secrets Every Engineer Must Know

Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.

Mistral releases Pixtral 12B, its first multimodal model

French AI startup Mistral launched Pixtral 12B, a multimodal model for processing images and text, featuring 12 billion parameters. The model is available under an Apache 2.0 license.

Pixtral 12B

Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.

Meta Llama 3 vision multimodal models – how to use them and what they can do

Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.

6 comments

By @cantSpellSober - 7 months

> outperforms Pixtral-12B and Llama3.2-11B

Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.

By @theanonymousone - 7 months

In an MoE model such as this, are all "parts" loaded in Memory at the same time, or at any given time only one part is loaded? For example, does Mixtral-8x7B have the memory requirement of a 7B model, or a 56B model?

By @niutech - 7 months

I’m curious how it compares with recently announced Molmo: https://molmo.org/

By @petemir - 7 months

Model should be available for testing here [0], although I tried to upload a video and got an error in Chinese, and whenever I write something it says that the API key is invalid or missing.

[0] https://rhymes.ai/

By @vessenes - 7 months

This looks worth a try. Great test results, very good example output. No way to know if it’s cherry picked / overtuned without giving it a spin, but it will go on my list. Should fit on an M2 Max at full precision.

By @SomewhatLikely - 7 months

"Here, we provide a quantifiable definition: A multimodal native model refers to a single model with strong understanding capabilities across multiple input modalities (e.g. text, code, image, video), that matches or exceeds the modality specialized models of similar capacities."

MIT researchers advance automated interpretability in AI models

Llama 3 Secrets Every Engineer Must Know

Mistral releases Pixtral 12B, its first multimodal model

French AI startup Mistral launched Pixtral 12B, a multimodal model for processing images and text, featuring 12 billion parameters. The model is available under an Apache 2.0 license.

Pixtral 12B

Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Related

MIT researchers advance automated interpretability in AI models

Llama 3 Secrets Every Engineer Must Know

Mistral releases Pixtral 12B, its first multimodal model

Pixtral 12B

Meta Llama 3 vision multimodal models – how to use them and what they can do

Related

MIT researchers advance automated interpretability in AI models

Llama 3 Secrets Every Engineer Must Know

Mistral releases Pixtral 12B, its first multimodal model

Pixtral 12B

Meta Llama 3 vision multimodal models – how to use them and what they can do