October 2nd, 2024

Nvidia releases NVLM 1.0 72B open weight model

NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.

Read original articleLink Icon
Nvidia releases NVLM 1.0 72B open weight model

NVIDIA has introduced NVLM 1.0, a series of advanced multimodal large language models (LLMs) that excel in vision-language tasks, competing with top proprietary and open-access models. The NVLM-D-72B, a decoder-only architecture, is now open-sourced for community use. This model demonstrates enhanced performance in text-only tasks following multimodal training. Benchmark results indicate that NVLM-D-72B achieves competitive scores across various multimodal benchmarks, such as MMMU, MathVista, and VQAv2, outperforming several existing models, including GPT-4o and Llama 3. The model has been adapted for use with Hugging Face, ensuring reproducibility and ease of inference. Users can load the model on multiple GPUs and utilize it for both text and image-based queries. The NVLM models are designed to facilitate advanced conversational capabilities, allowing users to interact with the model through text and images. The open-source release includes model weights, code, and detailed instructions for training and inference, promoting accessibility and collaboration within the AI community.

- NVIDIA has launched NVLM 1.0, a series of multimodal LLMs.

- NVLM-D-72B is open-sourced and shows improved performance in text-only tasks.

- The model competes with leading models like GPT-4o and Llama 3 in multimodal benchmarks.

- It supports multi-GPU loading and is designed for both text and image interactions.

- Comprehensive resources for training and inference are provided for community use.

Link Icon 7 comments
By @imjonse - about 2 months
It is a family of multimodal models based on pretrained Qwen2-72B-Instruct LLM and InterViT vision encoder. There are three variants differentiated by the way the vision tokens are used: decoder-only (like the majority of existing VLM), using cross-attention, and a hybrid. Only the first seems to be on huggingface at the moment.

Also they seem to only train on publically available data, concluding that quality is more important than scale.

By @keyboardsamurai - about 2 months
It has a non-commercial cc-by-nc-4.0 license, I would guess the only way to use this in production is to use Nvidias data centers to host it? Or are there other ways?
By @rd42 - about 2 months
I think the only relevant part to note here is that this model showed improved text-only performance after multimodal training. Wonder if this translates to Llama models also ? Is it possible to extend Llama 3.1 405b with multi-modal training to create another SOTA large model ?
By @optimalsolver - about 2 months
Reminder that Nvidia is still the only company making any money out of the "AI revolution".
By @jftuga - about 2 months
How much GPU RAM would be needed to run this with just one GPU?
By @cjtrowbridge - about 2 months
I love how they include a helpful chart that shows this model scores worse than everything else.