October 2nd, 2024

Nvidia releases NVLM 1.0 72B open weight model

NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.

Read original article

Nvidia releases NVLM 1.0 72B open weight model

NVIDIA has introduced NVLM 1.0, a series of advanced multimodal large language models (LLMs) that excel in vision-language tasks, competing with top proprietary and open-access models. The NVLM-D-72B, a decoder-only architecture, is now open-sourced for community use. This model demonstrates enhanced performance in text-only tasks following multimodal training. Benchmark results indicate that NVLM-D-72B achieves competitive scores across various multimodal benchmarks, such as MMMU, MathVista, and VQAv2, outperforming several existing models, including GPT-4o and Llama 3. The model has been adapted for use with Hugging Face, ensuring reproducibility and ease of inference. Users can load the model on multiple GPUs and utilize it for both text and image-based queries. The NVLM models are designed to facilitate advanced conversational capabilities, allowing users to interact with the model through text and images. The open-source release includes model weights, code, and detailed instructions for training and inference, promoting accessibility and collaboration within the AI community.

- NVIDIA has launched NVLM 1.0, a series of multimodal LLMs.

- NVLM-D-72B is open-sourced and shows improved performance in text-only tasks.

- The model competes with leading models like GPT-4o and Llama 3 in multimodal benchmarks.

- It supports multi-GPU loading and is designed for both text and image interactions.

- Comprehensive resources for training and inference are provided for community use.

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

The article discusses the release of open-source Llama3 70B model, highlighting its performance compared to GPT-4 and Claude3 Opus. It emphasizes training enhancements, data quality, and the competition between open and closed-source models.

Llama 3 Secrets Every Engineer Must Know

Llama 3 is an advanced open-source language model trained on 15 trillion multilingual tokens, featuring 405 billion parameters, improved reasoning, and multilingual capabilities, while exploring practical applications and limitations.

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Meta released Llama 3.2, featuring vision models with 11B and 90B parameters, and lightweight text models with 1B and 3B parameters, optimized for edge devices and supporting extensive deployment options.

Llama can now see and run on your device – welcome Llama 3.2

Meta has released Llama 3.2 with multimodal capabilities, smaller models for on-device use, and licensing restrictions for EU users. It supports multiple languages and integrates with Hugging Face Transformers.

NVLM 1.0: Nvidia new open-source model

NVIDIA's NVLM 1.0 introduces multimodal large language models excelling in vision-language tasks, with the 72B version showing improved text performance, novel architecture, and open-sourced resources for community benefit.

7 comments

By @imjonse - 7 months

It is a family of multimodal models based on pretrained Qwen2-72B-Instruct LLM and InterViT vision encoder. There are three variants differentiated by the way the vision tokens are used: decoder-only (like the majority of existing VLM), using cross-attention, and a hybrid. Only the first seems to be on huggingface at the moment.

Also they seem to only train on publically available data, concluding that quality is more important than scale.

By @keyboardsamurai - 7 months

It has a non-commercial cc-by-nc-4.0 license, I would guess the only way to use this in production is to use Nvidias data centers to host it? Or are there other ways?

By @rd42 - 7 months

I think the only relevant part to note here is that this model showed improved text-only performance after multimodal training. Wonder if this translates to Llama models also ? Is it possible to extend Llama 3.1 405b with multi-modal training to create another SOTA large model ?

By @optimalsolver - 7 months

Reminder that Nvidia is still the only company making any money out of the "AI revolution".

By @jftuga - 7 months

How much GPU RAM would be needed to run this with just one GPU?

By @cjtrowbridge - 7 months

I love how they include a helpful chart that shows this model scores worse than everything else.

Nvidia releases NVLM 1.0 72B open weight model

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Llama 3 Secrets Every Engineer Must Know

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Llama can now see and run on your device – welcome Llama 3.2

NVLM 1.0: Nvidia new open-source model

Related

Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU

Llama 3 Secrets Every Engineer Must Know

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models

Llama can now see and run on your device – welcome Llama 3.2

NVLM 1.0: Nvidia new open-source model