All-in-one embedding model for interleaved text, images, and screenshots
Voyage AI released voyage-multimodal-3, an embedding model that enhances retrieval accuracy by 19.63%, integrating text and images for improved performance in multimodal tasks, now available with 200 million free tokens.
Read original articleVoyage AI has announced the release of voyage-multimodal-3, a cutting-edge embedding model designed for processing interleaved text and images, including screenshots from various document types. This model significantly enhances retrieval accuracy, outperforming existing models by an average of 19.63% across three multimodal retrieval tasks evaluated on 20 datasets. Unlike traditional models that treat text and images separately, voyage-multimodal-3 integrates both modalities within the same transformer architecture, allowing for a unified representation that captures the contextual relationship between visual and textual information. This innovation addresses the limitations of existing models, which struggle with complex layouts and mixed-modality searches. In evaluations, voyage-multimodal-3 demonstrated superior performance in table/figure retrieval, document screenshot retrieval, and text-to-photo retrieval, achieving improvements of up to 2.2 times over competitors like OpenAI CLIP and Cohere multimodal v3. The model is now available for use, with the first 200 million tokens offered for free, and aims to simplify the process of vectorizing knowledge bases that include both structured and unstructured data.
- Voyage-multimodal-3 improves retrieval accuracy by an average of 19.63% over previous models.
- The model integrates text and image processing within the same architecture for better contextual understanding.
- It outperforms existing models like OpenAI CLIP and Cohere multimodal v3 by significant margins in various retrieval tasks.
- The model is available for use, with an initial offering of 200 million free tokens.
- Voyage-multimodal-3 simplifies the handling of complex document layouts and mixed-modality searches.
Related
Integrating Vision into RAG Applications
Retrieval Augmented Generation (RAG) now supports multimodal capabilities on Azure, enabling large language models to process text and images, enhancing query responses and improving utility in visual data fields.
Pixtral 12B
Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.
Nvidia releases NVLM 1.0 72B open weight model
NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.
Meta Llama 3 vision multimodal models – how to use them and what they can do
Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.
ARIA: An Open Multimodal Native Mixture-of-Experts Model
Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.
- Concerns about the limitations of existing multimodal models, particularly regarding the modality gap in search accuracy.
- Questions about the model's evaluation methods and performance on non-English text.
- Criticism of the model being API-only and proprietary, limiting accessibility for developers.
- Suggestions for qualitative analysis alongside quantitative benchmarks to better understand model performance.
- Discussion on the competitive landscape of multimodal models and the role of funding in the tech industry.
>All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.
https://github.com/tjmlabs/ColiVara
The main benchmark for this is the Vidore leaderboard. Where we would love to see where VoyageAI performs compared to the more open-source implementations.
Until now, the standard approach to creating multimodal models involved
training separate components for different modalities and then stitching them
together to roughly mimic some of this functionality. These models can
sometimes be good at performing certain tasks, like describing images, but
struggle with more conceptual and complex reasoning.
We designed Gemini to be natively multimodal, pre-trained from the start on
different modalities. Then we fine-tuned it with additional multimodal data to
further refine its effectiveness. This helps Gemini seamlessly understand and
reason about all kinds of inputs from the ground up, far better than existing
multimodal models — and its capabilities are state of the art in nearly every
domain.
I understand the model is, like for other commercial ones, available exclusively through their API, right?
Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.
https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...
Quantitative benchmarks are great, but sparse.
And just to be clear. I don't think that delivering strong embeddings for different domains is an easy task. However, it's 2024 not 2016.
Related
Integrating Vision into RAG Applications
Retrieval Augmented Generation (RAG) now supports multimodal capabilities on Azure, enabling large language models to process text and images, enhancing query responses and improving utility in visual data fields.
Pixtral 12B
Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.
Nvidia releases NVLM 1.0 72B open weight model
NVIDIA launched NVLM 1.0, featuring the open-sourced NVLM-D-72B model, which excels in multimodal tasks, outperforms competitors like GPT-4o, and supports multi-GPU loading for text and image interactions.
Meta Llama 3 vision multimodal models – how to use them and what they can do
Meta's Llama 3 model now supports multimodal inputs, allowing image and text processing. While it excels in image recognition and sentiment analysis, it shows significant limitations in reasoning and visual data interpretation.
ARIA: An Open Multimodal Native Mixture-of-Experts Model
Aria is an open-source multimodal AI model with 3.9 billion visual and 3.5 billion text parameters, outperforming proprietary models and enhancing capabilities through a four-stage pre-training pipeline.