July 18th, 2024

Mistral NeMo

Mistral AI introduces Mistral NeMo, a powerful 12B model developed with NVIDIA. It features a large context window, strong reasoning abilities, and FP8 inference support. Available under Apache 2.0 license for diverse applications.

Read original articleLink Icon
Mistral NeMo

Mistral AI has introduced Mistral NeMo, a cutting-edge 12B model developed in collaboration with NVIDIA. This model boasts a 128k token context window, exceptional reasoning abilities, and coding accuracy within its size category. Released under the Apache 2.0 license, Mistral NeMo offers pre-trained base and instruction-tuned checkpoints to facilitate adoption by researchers and enterprises. Notably, it supports FP8 inference without compromising performance due to its training with quantization awareness. Designed for global applications, Mistral NeMo excels in various languages and utilizes the Tekken tokenizer, which outperforms previous models in compressing natural language text and source code. The model is available for use through mistral-inference and mistral-finetune, hosted on HuggingFace, and accessible as a NVIDIA NIM inference microservice. Mistral NeMo represents a significant advancement in making frontier AI models accessible across multiple languages and applications.

Related

NuExtract: A LLM for Structured Extraction

NuExtract: A LLM for Structured Extraction

NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.

My finetuned models beat OpenAI's GPT-4

My finetuned models beat OpenAI's GPT-4

Alex Strick van Linschoten discusses his finetuned models Mistral, Llama3, and Solar LLMs outperforming OpenAI's GPT-4 in accuracy. He emphasizes challenges in evaluation, model complexities, and tailored prompts' importance.

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI

Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.

Meta AI develops compact language model for mobile devices

Meta AI develops compact language model for mobile devices

Meta AI introduces MobileLLM, a compact language model challenging the need for large AI models. Optimized with under 1 billion parameters, it outperforms larger models by 2.7% to 4.3% on tasks. MobileLLM's innovations include model depth prioritization, embedding sharing, grouped-query attention, and weight-sharing techniques. The 350 million parameter version matches larger models' accuracy on specific tasks, hinting at compact models' potential for efficiency. While not publicly available, Meta has open-sourced the pre-training code, promoting research towards sustainable AI models for personal devices.

Codestral Mamba

Codestral Mamba

Codestral Mamba, a new Mamba2 language model by Mistral AI, excels in code generation with linear time inference and infinite sequence modeling. It rivals transformer models, supports 256k tokens, and aids local code assistance. Deployable via mistral-inference SDK or TensorRT-LLM, it's open-source under Apache 2.0.

Link Icon 30 comments
By @yjftsjthsd-h - 6 months
> Today, we are excited to release Mistral NeMo, a 12B model built in collaboration with NVIDIA. Mistral NeMo offers a large context window of up to 128k tokens. Its reasoning, world knowledge, and coding accuracy are state-of-the-art in its size category. As it relies on standard architecture, Mistral NeMo is easy to use and a drop-in replacement in any system using Mistral 7B.

> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.

So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.

By @minimaxir - 6 months
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.

By @alecco - 6 months
Nvidia has a blogpost about Mistral Nemo, too. https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/

> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.

> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.

> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.

By @dpflan - 6 months
These big models are getting pumped out like crazy, that is the business of these companies. But basically, it feels like private/industry just figured out how to scale up a scalable process (deep learning), and it required not $M research grants but $BB "research grants"/funding, and the scaling laws seem to be fun to play with and tweak more interesting things out of these and find cool "emergent" behavior as billions of data points get correlated.

But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.

By @mcemilg - 6 months
I believe that if Mistral is serious about advancing in open source, they should consider sharing the corpus used for training their models, at least the base models pretraining data.
By @jorgesborges - 6 months
I’m AI stupid. Does anyone know if training on multiple languages provides “cross-over” — so training done in German can be utilized when answering a prompt in English? I once went through various Wikipedia articles in a couple languages and the differences were interesting. For some reason I thought they’d be almost verbatim (forgetting that’s not how Wikipedia works!) and while I can’t remember exactly I felt they were sometimes starkly different in tone and content.
By @eigenvalue - 6 months
I have to say, the experience of trying to sign up for Nvidia Enterprise so you can try the "NIM" packaged version of this model, is just icky and and awful now that I've gotten used to actually free and open models and software. It feels much nicer and more free to be able to clone llama.cpp and wget a .gguf model file from huggingface without any registration at all. Especially since it has now been several hours since I signed up for the Nvidia account and it still says on the website "Your License Should be Active Momentarily | We're setting up your credentials to download NIMs."

I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.

By @andrethegiant - 6 months
I still don’t understand the business model of releasing open source gen AI models. If this took 3072 H100s to train, why are they releasing it for free? I understand they charge people when renting from their platform, but why permit people to run it themselves?
By @pixelatedindex - 6 months
Pardon me if this is a dumb question, but is it possible for me to download these models into my computer (I have a 1080ti and a [2|3]070ti) and generate some sort of api interface? That way I can write programs that calls this API, and I find this appealing.

EDIT: This a 1W light bulb moment for me, thank you!

By @simonw - 6 months
I wonder why Mistral et al don't prepare GGUF versions of these for launch day?

If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.

By @bugglebeetle - 6 months
Interested in the new base model for fine tuning. Despite Llama3 being a better instruct model overall, it’s been highly resistant to fine-tuning, either owing to some bugs or being trained on so much data (ongoing debate about this in the community). Mistral’s base model are still best in class for small model you can specialize.
By @madeofpalk - 6 months
I find it interesting how coding/software development still appears to be the one category that these most popular models release specialised models for. Where's the finance or legal models from Mistral or Meta or OpenAI?

Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.

By @adt - 6 months
That's 3 releases for Mistral in 24 hours.

https://lifearchitect.ai/models-table/

By @pants2 - 6 months
Exciting, I think 12B is the sweet spot for running locally - large enough to be useful, fast enough to run on a decent laptop.
By @zone411 - 6 months
Interesting that the benchmarks they show have it outperforming Gemma 2 9B and Llama 3 8B, but it does a lot worse on my NYT Connections benchmark (5.1 vs 16.3 and 12.3). The new GPT-4o mini also does better at 14.3. It's just one benchmark though, so looking forward to additional scores.
By @Workaccount2 - 6 months
Is "Parameter Creep" going to becomes a thing? They hold up Llama-8b as a competitor despite NeMo having 50% more parameters.

The same thing happened with gemma-27b, where they compared it to all the 7-9b models.

It seems like an easy way to boost benchmarks while coming off as "small" at first glance.

By @PoignardAzur - 6 months
> Mistral NeMo uses a new tokenizer, Tekken, based on Tiktoken, that was trained on over more than 100 languages, and compresses natural language text and source code more efficiently than the SentencePiece tokenizer used in previous Mistral models.

From Mistral's page about Tekken:

> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.

Does that mean that Mistral found that BPE is more efficient than unigram models?

Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.

By @danielhanchen - 6 months
I just managed to make Mistral NeMo 4bit QLoRA finetuning fit in under 12GB, so it fits in a free Google Colab with a Tesla T4 GPU! VRAM is shaved by 60% and finetuning is also 2x faster! Colab: https://colab.research.google.com/github/unslothai/studio/bl...
By @wkcheng - 6 months
Does anyone know whether the 128K is input tokens only? There are a lot of models that have a large context window for input but a small output context. If this actually has 128k tokens shared between input and output, that would be a game changer.
By @hislaziness - 6 months
I just checked huggingface and the model files download is about 25GB but in a comment below someone mentioned it is 8fp quantized model. Trying to understand how the quantization affects the model (and RAM) size. Can someone please enlighten.
By @ofermend - 6 months
Congrats. Very exciting to see continued innovation around smaller models, that can perform much better than larger models. This enables faster inference and makes them more ubiquitous.
By @obblekk - 6 months
Worth noting this model has 50% more parameters than llama3. There are performance gains but some of the gains might be from using more compute rather than performance per unit compute.
By @davidzweig - 6 months
Did anyone try to check how are it's multilingual skills vs. Gemma 2? On the page, it's compared with LLama 3 only.
By @p1esk - 6 months
Interesting how it will compete with 4o mini.
By @lostmsu - 6 months
Gonna wait for LMSYS benchmarks. The "standard" benchmarks all seem unreliable.
By @saberience - 6 months
Two questions:

1) Anyone have any idea of VRAM requirements?

2) When will this be available on ollama?

By @I_am_tiberius - 6 months
The last time I tried a Mistral model, it didn't answer most of my questions, because of "policy" reasons. I hope they fixed that. OpenAI at least only tells me that it's a policy issue but still answers most of the time.
By @k__ - 6 months
What's the reason for measuring the model size in context window length and not GB?

Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.

By @pantulis - 6 months
Does it have any relation to Nvidia's Nemo? Otherwise, it's unfortunate naming
By @LoganDark - 6 months
Is the base model unaligned? Disappointing to see alignment from allegedly "open" models.