Mistral NeMo
Mistral AI introduces Mistral NeMo, a powerful 12B model developed with NVIDIA. It features a large context window, strong reasoning abilities, and FP8 inference support. Available under Apache 2.0 license for diverse applications.
Read original articleMistral AI has introduced Mistral NeMo, a cutting-edge 12B model developed in collaboration with NVIDIA. This model boasts a 128k token context window, exceptional reasoning abilities, and coding accuracy within its size category. Released under the Apache 2.0 license, Mistral NeMo offers pre-trained base and instruction-tuned checkpoints to facilitate adoption by researchers and enterprises. Notably, it supports FP8 inference without compromising performance due to its training with quantization awareness. Designed for global applications, Mistral NeMo excels in various languages and utilizes the Tekken tokenizer, which outperforms previous models in compressing natural language text and source code. The model is available for use through mistral-inference and mistral-finetune, hosted on HuggingFace, and accessible as a NVIDIA NIM inference microservice. Mistral NeMo represents a significant advancement in making frontier AI models accessible across multiple languages and applications.
Related
NuExtract: A LLM for Structured Extraction
NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.
My finetuned models beat OpenAI's GPT-4
Alex Strick van Linschoten discusses his finetuned models Mistral, Llama3, and Solar LLMs outperforming OpenAI's GPT-4 in accuracy. He emphasizes challenges in evaluation, model complexities, and tailored prompts' importance.
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
Meta AI develops compact language model for mobile devices
Meta AI introduces MobileLLM, a compact language model challenging the need for large AI models. Optimized with under 1 billion parameters, it outperforms larger models by 2.7% to 4.3% on tasks. MobileLLM's innovations include model depth prioritization, embedding sharing, grouped-query attention, and weight-sharing techniques. The 350 million parameter version matches larger models' accuracy on specific tasks, hinting at compact models' potential for efficiency. While not publicly available, Meta has open-sourced the pre-training code, promoting research towards sustainable AI models for personal devices.
Codestral Mamba
Codestral Mamba, a new Mamba2 language model by Mistral AI, excels in code generation with linear time inference and infinite sequence modeling. It rivals transformer models, supports 256k tokens, and aids local code assistance. Deployable via mistral-inference SDK or TensorRT-LLM, it's open-source under Apache 2.0.
> We have released pre-trained base and instruction-tuned checkpoints checkpoints under the Apache 2.0 license to promote adoption for researchers and enterprises. Mistral NeMo was trained with quantisation awareness, enabling FP8 inference without any performance loss.
So that's... uniformly an improvement at just about everything, right? Large context, permissive license, should have good perf. The one thing I can't tell is how big 12B is going to be (read: how much VRAM/RAM is this thing going to need). Annoyingly and rather confusingly for a model under Apache 2.0, https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407 refuses to show me files unless I login and "You need to agree to share your contact information to access this model"... though if it's actually as good as it looks, I give it hours before it's reposted without that restriction, which Apache 2.0 allows.
Does anyone have a good answer why everyone went back to SentencePiece in the first place? Byte-pair encoding (which is what tiktoken uses: https://github.com/openai/tiktoken) was shown to be a more efficient encoding as far back as GPT-2 in 2019.
> Mistral NeMo comes packaged as an NVIDIA NIM inference microservice, offering performance-optimized inference with NVIDIA TensorRT-LLM engines.
> *Designed to fit on the memory of a single NVIDIA L40S, NVIDIA GeForce RTX 4090 or NVIDIA RTX 4500 GPU*, the Mistral NeMo NIM offers high efficiency, low compute cost, and enhanced security and privacy.
> The model was trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs on DGX Cloud, composed of NVIDIA AI architecture, including accelerated computing, network fabric and software to increase training efficiency.
But pumping out models and putting artifacts on HuggingFace, is that a business? What are these models being used for? There is a new one at a decent clip.
I really don't get Nvidia's thinking with this. They basically have a hardware monopoly. I shelled out the $4,000 or so to buy two of their 4090 GPUs. Why are they still insisting on torturing me with jumping through these awful hoops? They should just be glad that they're winning and embrace freedom.
EDIT: This a 1W light bulb moment for me, thank you!
If I were them I'd want to be the default source of the versions of my models that people use, rather than farming that out to whichever third party races to publish the GGUF (and other formats) first.
Perhaps it's just confirmation bias, but programming really does seem to be the ideal usecase for LLMs in a way that other professions just haven't been able to crack. Compared to other types of work, it's relatively more straight forward to tell if code is "correct" or not.
The same thing happened with gemma-27b, where they compared it to all the 7-9b models.
It seems like an easy way to boost benchmarks while coming off as "small" at first glance.
From Mistral's page about Tekken:
> Our newest tokenizer, tekken, uses the Byte-Pair Encoding (BPE) with Tiktoken.
Does that mean that Mistral found that BPE is more efficient than unigram models?
Because otherwise, I don't understand why AI companies keep using BPE for their token sets. Unigram methods leads to more legible tokens, fewer glitch tokens, fewer super-long outlier tokens, etc.
1) Anyone have any idea of VRAM requirements?
2) When will this be available on ollama?
Also, are these small models OSS? Easier self hosting seems to be the main benefo for small models.
Related
NuExtract: A LLM for Structured Extraction
NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.
My finetuned models beat OpenAI's GPT-4
Alex Strick van Linschoten discusses his finetuned models Mistral, Llama3, and Solar LLMs outperforming OpenAI's GPT-4 in accuracy. He emphasizes challenges in evaluation, model complexities, and tailored prompts' importance.
Benchmarking LLM Inference Back Ends: VLLM, LMDeploy, MLC-LLM, TensorRT-LLM, TGI
Selecting the right inference backend for large language models is crucial for user experience and cost efficiency. A benchmark study by BentoML compared various backends, highlighting LMDeploy's decoding performance, vLLM's low TTFT, and considerations beyond performance. BentoML and BentoCloud are recommended tools for efficient AI model deployment.
Meta AI develops compact language model for mobile devices
Meta AI introduces MobileLLM, a compact language model challenging the need for large AI models. Optimized with under 1 billion parameters, it outperforms larger models by 2.7% to 4.3% on tasks. MobileLLM's innovations include model depth prioritization, embedding sharing, grouped-query attention, and weight-sharing techniques. The 350 million parameter version matches larger models' accuracy on specific tasks, hinting at compact models' potential for efficiency. While not publicly available, Meta has open-sourced the pre-training code, promoting research towards sustainable AI models for personal devices.
Codestral Mamba
Codestral Mamba, a new Mamba2 language model by Mistral AI, excels in code generation with linear time inference and infinite sequence modeling. It rivals transformer models, supports 256k tokens, and aids local code assistance. Deployable via mistral-inference SDK or TensorRT-LLM, it's open-source under Apache 2.0.