NVLM 1.0: Nvidia new open-source model
NVIDIA's NVLM 1.0 introduces multimodal large language models excelling in vision-language tasks, with the 72B version showing improved text performance, novel architecture, and open-sourced resources for community benefit.
Read original articleNVIDIA has introduced NVLM 1.0, a new family of multimodal large language models (LLMs) that achieve state-of-the-art performance in vision-language tasks, competing with both proprietary models like GPT-4o and open-access models such as Llama 3-V. The NVLM 1.0 model, particularly the 72B version, shows significant improvements in text-only tasks after multimodal training, with an average accuracy increase of 4.3 points. It outperforms or matches leading models across various benchmarks, including MathVista and OCRBench. The model demonstrates strong instruction-following capabilities and excels in tasks requiring reasoning, localization, and coding. Notably, NVLM 1.0 employs a novel architecture that enhances training efficiency and multimodal reasoning, alongside a unique 1-D tile-tagging design for high-resolution images. The training data is meticulously curated, emphasizing the importance of dataset quality and task diversity over sheer scale. The open-sourcing of model weights and training code aims to benefit the broader community.
- NVLM 1.0 achieves state-of-the-art results in vision-language tasks.
- The 72B model shows improved performance in text-only tasks post-multimodal training.
- It outperforms or matches leading models on key benchmarks.
- The model features a novel architecture for enhanced multimodal reasoning.
- Open-sourcing of model weights and training code is intended for community use.
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
Llama 3 Secrets Every Engineer Must Know
Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.
Llama can now see and run on your device – welcome Llama 3.2
Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.
Related
Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU
The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.
Llama 3 Secrets Every Engineer Must Know
Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.
Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.
Llama can now see and run on your device – welcome Llama 3.2
Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.