ARIA: An Open Multimodal Native Mixture-of-Experts Model
Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.
Read original articleAria is an open multimodal native mixture-of-experts model introduced by a team of researchers led by Dongxu Li. This model aims to address the limitations of proprietary multimodal AI models by providing an open-source alternative that integrates diverse modalities for comprehensive understanding. Aria boasts impressive performance metrics, with 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively. It outperforms existing models such as Pixtral-12B and Llama3.2-11B, and competes effectively with top proprietary models across various multimodal tasks. The model was pre-trained from scratch using a four-stage pipeline that enhances its capabilities in language understanding, multimodal comprehension, long context processing, and instruction following. The researchers have made the model weights and a supporting codebase available for public use, facilitating easy adoption and adaptation in real-world applications.
- Aria is an open-source multimodal AI model designed to integrate diverse information modalities.
- It features 3.9 billion and 3.5 billion activated parameters for visual and text tokens, respectively.
- The model outperforms several existing proprietary models in various multimodal tasks.
- Aria was pre-trained using a four-stage pipeline to enhance its capabilities.
- The model weights and codebase are available for public use, promoting accessibility and adaptability.
Related
MIT researchers advance automated interpretability in AI models
MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.
Llama 3 Secrets Every Engineer Must Know
Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.
Mistral releases Pixtral 12B, its first multimodal model
French AI startup Mistral launched Pixtral 12B, a multimodal model for processing images and text, featuring 12 billion parameters. The model is available under an Apache 2.0 license.
Pixtral 12B
Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.
Meta Llama 3 vision multimodal models – how to use them and what they can do
Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.
Cool, maybe needs of a better name for SEO though. ARIA has meaning in web apps.
Related
MIT researchers advance automated interpretability in AI models
MIT researchers developed MAIA, an automated system enhancing AI model interpretability, particularly in vision systems. It generates hypotheses, conducts experiments, and identifies biases, improving understanding and safety in AI applications.
Llama 3 Secrets Every Engineer Must Know
Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.
Mistral releases Pixtral 12B, its first multimodal model
French AI startup Mistral launched Pixtral 12B, a multimodal model for processing images and text, featuring 12 billion parameters. The model is available under an Apache 2.0 license.
Pixtral 12B
Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.
Meta Llama 3 vision multimodal models – how to use them and what they can do
Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.