September 17th, 2024

Pixtral 12B

Mistral AI released Pixtral 12B, its first multimodal model for image and text processing, achieving 52.5% on the MMMU benchmark and supporting variable image sizes in a 128K token context.

Read original articleLink Icon
Pixtral 12B

Mistral AI has announced the release of Pixtral 12B, its first multimodal model designed to process both images and text. This model features a new 400M parameter vision encoder and a 12B parameter multimodal decoder, allowing it to handle variable image sizes and multiple images within a long context window of 128K tokens. Pixtral 12B excels in multimodal tasks such as document question answering and chart understanding, achieving a score of 52.5% on the MMMU reasoning benchmark, outperforming several larger models. It maintains strong performance on text-only benchmarks, making it a versatile tool for developers. The model is designed to be a drop-in replacement for Mistral Nemo 12B, providing best-in-class multimodal reasoning without sacrificing text capabilities. Pixtral's architecture allows for efficient processing of images at their native resolution, enhancing its ability to understand complex diagrams and documents. The model has been benchmarked against both open and closed models, demonstrating superior performance in instruction following tasks. Pixtral is available for use via La Plateforme and Le Chat, with open-source prompts and evaluation benchmarks to be shared with the community.

- Pixtral 12B is Mistral AI's first multimodal model, integrating image and text processing.

- It achieves a 52.5% score on the MMMU reasoning benchmark, outperforming larger models.

- The model supports variable image sizes and can process multiple images in a 128K token context.

- Pixtral excels in both multimodal and text-only instruction following tasks.

- It is available for use through La Plateforme and Le Chat, with open-source resources planned.

Related

Mistral NeMo

Mistral NeMo

Mistral AI introduces Mistral NeMo, a powerful 12B model developed with NVIDIA. It features a large context window, strong reasoning abilities, and FP8 inference support. Available under Apache 2.0 license for diverse applications.

Mathstral: 7B LLM designed for math reasoning and scientific discovery

Mathstral: 7B LLM designed for math reasoning and scientific discovery

MathΣtral, a new 7B model by Mistral AI, focuses on math reasoning and scientific discovery, inspired by Archimedes and Newton. It excels in STEM with high reasoning abilities, scoring 56.6% on MATH and 63.47% on MMLU. The model's release under Apache 2.0 license supports academic projects, showcasing performance/speed tradeoffs in specialized models. Further enhancements can be achieved through increased inference-time computation. Professor Paul Bourdon's curation of GRE Math Subject Test problems contributed to the model's evaluation. Instructions for model use and fine-tuning are available in the documentation hosted on HuggingFace.

Large Enough – Mistral AI

Large Enough – Mistral AI

Mistral AI released Mistral Large 2, enhancing code generation, reasoning, and multilingual support with 123 billion parameters. It outperforms competitors and is available for research use via various cloud platforms.

Mistral Agents

Mistral Agents

Mistral AI has improved model customization for its flagship models, introduced "Agents" for custom workflows, and released a stable SDK version, enhancing accessibility and effectiveness of generative AI for developers.

Mistral releases Pixtral 12B, its first multimodal model

Mistral releases Pixtral 12B, its first multimodal model

French AI startup Mistral launched Pixtral 12B, a multimodal model for processing images and text, featuring 12 billion parameters. The model is available under an Apache 2.0 license.

Link Icon 7 comments
By @mkaic - 5 months
As much as I love their work, I can't be the only one who really struggles to see a path to profitability for Mistral, right? How do you make money selling API access to a model which anyone else can spin up an API for (license is Apache 2.0) on AWS or GCP or similar? Do they have some sort of magic inference optimization that allows them to be cheaper per-token than other hosting providers? Why would I use their API instead of anybody else's?

Asking these questions as a genuine fan of this company—I really want to believe they can succeed and not go the way of StabilityAI.

By @devinprater - 5 months
Seems okay at image descriptions, I suppose. Still a 12B model though, and doesn't always get OCR anywhere near correct. I tried it on Le Chat, and waiting for it to be on Ollama.
By @davedx - 5 months
Anyone from Mistral here? The link to the docs is broken, and I really would like to know more about what the specifications are for calling this via API. Foremost, what's the maximum image size you can use via the API? Thank you!
By @adt - 5 months
Nemo 12B MMLU=68.0

Pixtral 12B MMLU=69.2

Looking at images made it smarter...

https://lifearchitect.ai/models-table/

By @etaioinshrdlu - 5 months
It would be interesting to add a decoder for image outputs, similar to GPT-4o (that feature hasn't been talked about much, or released...).
By @adzm - 5 months
Very impressive results