November 17th, 2024

All-in-one embedding model for interleaved text, images, and screenshots

Voyage AI released voyage-multimodal-3, an embedding model that enhances retrieval accuracy by 19.63%, integrating text and images for improved performance in multimodal tasks, now available with 200 million free tokens.

Read original article

CuriositySkepticismDisappointment

All-in-one embedding model for interleaved text, images, and screenshots

Voyage AI has announced the release of voyage-multimodal-3, a cutting-edge embedding model designed for processing interleaved text and images, including screenshots from various document types. This model significantly enhances retrieval accuracy, outperforming existing models by an average of 19.63% across three multimodal retrieval tasks evaluated on 20 datasets. Unlike traditional models that treat text and images separately, voyage-multimodal-3 integrates both modalities within the same transformer architecture, allowing for a unified representation that captures the contextual relationship between visual and textual information. This innovation addresses the limitations of existing models, which struggle with complex layouts and mixed-modality searches. In evaluations, voyage-multimodal-3 demonstrated superior performance in table/figure retrieval, document screenshot retrieval, and text-to-photo retrieval, achieving improvements of up to 2.2 times over competitors like OpenAI CLIP and Cohere multimodal v3. The model is now available for use, with the first 200 million tokens offered for free, and aims to simplify the process of vectorizing knowledge bases that include both structured and unstructured data.

- Voyage-multimodal-3 improves retrieval accuracy by an average of 19.63% over previous models.

- The model integrates text and image processing within the same architecture for better contextual understanding.

- It outperforms existing models like OpenAI CLIP and Cohere multimodal v3 by significant margins in various retrieval tasks.

- The model is available for use, with an initial offering of 200 million free tokens.

- Voyage-multimodal-3 simplifies the handling of complex document layouts and mixed-modality searches.

Integrating Vision into RAG Applications

Retrieval Augmented Generation (RAG) now supports multimodal capabilities on Azure, enabling large language models to process text and images, enhancing query responses and improving utility in visual data fields.

Pixtral 12B

Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.

Nvidia releases NVLM 1.0 72B open weight model

NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.

Meta Llama 3 vision multimodal models – how to use them and what they can do

Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.

AI: What people are saying

The release of Voyage AI's multimodal embedding model has generated a variety of responses from the community.

Concerns about the limitations of existing multimodal models, particularly regarding the modality gap in search accuracy.
Questions about the model's evaluation methods and performance on non-English text.
Criticism of the model being API-only and proprietary, limiting accessibility for developers.
Suggestions for qualitative analysis alongside quantitative benchmarks to better understand model performance.
Discussion on the competitive landscape of multimodal models and the role of funding in the tech industry.

13 comments

By @djoldman - 6 months

This is a key observation that is simple and intuitive:

>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.

By @jonathan-adly - 6 months

If you are interested in that space, would throw our project in the mix which uses ColPali under the hood transparently.

https://github.com/tjmlabs/ColiVara

The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.

By @FergusArgyll - 6 months

I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini

  Until now, the standard approach to creating multimodal models involved 
  training separate components for different modalities and then stitching them 
  together to roughly mimic some of this functionality. These models can 
  sometimes be good at performing certain tasks, like describing images, but  
  struggle with more conceptual and complex reasoning.

  We designed Gemini to be natively multimodal, pre-trained from the start on 
  different modalities. Then we fine-tuned it with additional multimodal data to 
  further refine its effectiveness. This helps Gemini seamlessly understand and 
  reason about all kinds of inputs from the ground up, far better than existing 
  multimodal models — and its capabilities are state of the art in nearly every 
  domain.

By @carschno - 6 months

This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?

I understand the model is, like for other commercial ones, available exclusively through their API, right?

By @greatgib - 6 months

Indeed, sad that their models are both commercial proprietary and API only.

By @Zopieux - 6 months

API-only model. No thanks but congrats anyway.

By @ritabratamaiti - 6 months

Looks quite interesting! I’ve been working on AnyModal, a framework for integrating different data types (like images and audio) with LLMs: https://github.com/ritabratamaiti/AnyModal. It seems that voyage-multimodal-3 would be quite promising in developing multimodal LLMs, but I am not sure if that is the intended use case.

By @unit149 - 6 months

In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.

Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.

[1]: https://docs.voyageai.com/docs/tokenization

By @djoldman - 6 months

This is a cool way to look at multimodal embeddings. They look at performance as the the percentage of inputs slides from one modality to another:

https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...

By @mech4lunch - 6 months

The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?

By @skeptrune - 6 months

I wish people would take the time to put in real datasets and make qualitative analysis of when and why "foo new solution" is better.

Quantitative benchmarks are great, but sparse.

By @joanfihu - 6 months

Check out ColPali and ColQwen for a SOTA open source version.

By @tinyhouse - 6 months

Funny, all those big name Stanford advisors for a company that builds embeddings... A couple of strong MLEs can deliver everything they are doing. This shouldn't be a company but OK... I'm sure some clueless VCs in SV gave them money.

And just to be clear. I don't think that delivering strong embeddings for different domains is an easy task. However, it's 2024 not 2016.

Integrating Vision into RAG Applications

Pixtral 12B

Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.

All-in-one embedding model for interleaved text, images, and screenshots

Related

Integrating Vision into RAG Applications

Pixtral 12B

Nvidia releases NVLM 1.0 72B open weight model

Meta Llama 3 vision multimodal models – how to use them and what they can do

ARIA: An Open Multimodal Native Mixture-of-Experts Model

Related

Integrating Vision into RAG Applications

Pixtral 12B

Nvidia releases NVLM 1.0 72B open weight model

Meta Llama 3 vision multimodal models – how to use them and what they can do

ARIA: An Open Multimodal Native Mixture-of-Experts Model