July 24th, 2024

Large Enough – Mistral AI

Mistral AI released Mistral Large 2, enhancing code generation, reasoning, and multilingual support with 123 billion parameters. It outperforms competitors and is available for research use via various cloud platforms.

Read original article

Mistral AI has announced the release of Mistral Large 2, an advanced version of its flagship model, which significantly enhances capabilities in code generation, mathematics, reasoning, and multilingual support. This model features a 128k context window and supports numerous languages, including major European and Asian languages, as well as over 80 programming languages. Mistral Large 2 is designed for efficient single-node inference, boasting 123 billion parameters for high throughput.

The model demonstrates improved performance metrics, achieving an accuracy of 84.0% on the MMLU benchmark and outperforming its predecessor and competing models like GPT-4o and Claude 3 Opus. Key enhancements include reduced "hallucination" tendencies, better reasoning skills, and improved instruction-following capabilities, making it adept at handling complex queries and multi-turn conversations.

Mistral Large 2 is available under the Mistral Research License for research and non-commercial use. It is accessible via la Plateforme and can be tested through the API named mistral-large-2407. The model is part of a broader consolidation of Mistral's offerings, which includes general-purpose and specialist models. Additionally, Mistral AI has expanded partnerships with major cloud service providers, making its models available on platforms like Google Cloud, Azure, Amazon Bedrock, and IBM Watson. Fine-tuning capabilities for Mistral Large, Mistral Nemo, and Codestral have also been introduced on la Plateforme.

Codestral Mamba

Codestral Mamba, a new Mamba2 language model by Mistral AI, excels in code generation with linear time inference and infinite sequence modeling. It rivals transformer models, supports 256k tokens, and aids local code assistance. Deployable via mistral-inference SDK or TensorRT-LLM, it's open-source under Apache 2.0.

Mistral NeMo

Mistral AI introduces Mistral NeMo, a powerful 12B model developed with NVIDIA. It features a large context window, strong reasoning abilities, and FP8 inference support. Available under Apache 2.0 license for diverse applications.

Mathstral: 7B LLM designed for math reasoning and scientific discovery

MathΣtral, a new 7B model by Mistral AI, focuses on math reasoning and scientific discovery, inspired by Archimedes and Newton. It excels in STEM with high reasoning abilities, scoring 56.6% on MATH and 63.47% on MMLU. The model's release under Apache 2.0 license supports academic projects, showcasing performance/speed tradeoffs in specialized models. Further enhancements can be achieved through increased inference-time computation. Professor Paul Bourdon's curation of GRE Math Subject Test problems contributed to the model's evaluation. Instructions for model use and fine-tuning are available in the documentation hosted on HuggingFace.

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Mathstral is a new 7B model by Mistral AI for math reasoning, with a 32k context window and Apache 2.0 license. It aims to improve common sense in math problem-solving, deployable locally with LlamaEdge and shareable via GaiaNet for customization and integration.

Run Mistral 7B model using less than 4GB of memory on your Mac with CoreML

Apple introduced Apple Intelligence at WWDC 24, highlighting Core ML's efficiency on Apple Silicon hardware. New features enable running large language models like Mistral 7B on Mac devices with reduced memory usage.

45 comments

By @tikkun - 7 months

Links to chat with models that released this week:

Large 2 - https://chat.mistral.ai/chat

Llama 3.1 405b - https://www.llama2.ai/

I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.

I'd rank as:

1. Sonnet 3.5

2. Large 2 and Llama 405b (similar, no clear winner between the two)

If you're using Claude, stick with it.

My Claude wishlist:

1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)

2. Longer context window (1M+)

3. Native audio input including tone understanding

4. Fewer refusals and less moralizing when refusing

5. Faster

6. More tokens in output

By @TIPSIO - 7 months

This race for the top model is getting wild. Everyone is claiming to one-up each with every version.

My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.

I'm not really sure how to even test/use Mistral or Llama for everyday use though.

By @huevosabio - 7 months

The non-commercial license is underwhelming.

It seems to be competitive with Llama 3.1 405b but with a much more restrictive license.

Given how the difference between these models is shrinking, I think you're better off using llama 405B to finetune the 70B on the specific use case.

This would be different if it was a major leap in quality, but it doesn't seem to be.

Very glad that there's a lot of competition at the top, though!

By @wesleyyue - 7 months

I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.

example: https://x.com/WesleyYue/status/1816153964934750691

By @Liquix - 7 months

These companies full of brilliant engineers are throwing millions of dollars in training costs to produce SOTA models that are... "on par with GPT-4o and Claude Opus"? And then the next 2.23% bump will cost another XX million? It seems increasingly apparent that we are reaching the limits of throwing more data at more GPUs; that an ARC prize level breakthrough is needed to move the needle any farther at this point.

By @nen-nomad - 7 months

The models are converging slowly. In the end, it will come down to the user experience and the "personality." I have been enjoying the new Claude Sonnet. It feels sharper than the others, even though it is not the highest-scoring one.

One thing that `exponentialists` forget is that each step also requires exponentially more energy and resources.

By @bugglebeetle - 7 months

I love how much AI is bringing competition (and thus innovation) back to tech. Feels like things were stagnant for 5-6 years prior because of the FAANG stranglehold on the industry. Love also that some of this disruption is coming at out of France (HuggingFace and Mistral), which Americans love to typecast as incapable of this.

By @SebaSeba - 7 months

Sorry for the slightly off topic question, but can someone enlighten me which Claude model is more capable, Opus or Sonnet 3.5? I am confused because I see people fuzzing about Sonnet 3.5 being the best and yet somehow I seem to read again and again in factual texts and some benchmarks that Claude Opus is the most capable. Is there a simple answer to the question, what do I not understand? Please, thank you.

By @OldGreenYodaGPT - 7 months

I still prefer ChatGPT-4o and use Claude if I have issues but never does any better

By @rkwasny - 7 months

All evals we have are just far too easy! <1% difference is just noise/bad data

We need to figure out how to measure intelligence that is greater than human.

By @calibas - 7 months

"Mistral Large 2 is equipped with enhanced function calling and retrieval skills and has undergone training to proficiently execute both parallel and sequential function calls, enabling it to serve as the power engine of complex business applications."

Why does the chart below say the "Function Calling" accuracy is about 50%? Does that mean it fails half the time with complex operations?

By @freediver - 7 months

Sharing PyLLMs [1] reasoning benchmark results for some of the recent models. Surprised by nemo (speed/quality) and mistral large is actually pretty good (but painfully slow).

AnthropicProvider('claude-3-haiku-20240307') Median Latency: 1.61 | Aggregated speed: 122.50 | Accuracy: 44.44%

MistralProvider('open-mistral-nemo') Median Latency: 1.37 | Aggregated speed: 100.37 | Accuracy: 51.85%

OpenAIProvider('gpt-4o-mini') Median Latency: 2.13 | Aggregated speed: 67.59 | Accuracy: 59.26%

MistralProvider('mistral-large-latest') Median Latency: 10.18 | Aggregated speed: 18.64 | Accuracy: 62.96%

AnthropicProvider('claude-3-5-sonnet-20240620') Median Latency: 3.61 | Aggregated speed: 59.70 | Accuracy: 62.96%

OpenAIProvider('gpt-4o') Median Latency: 3.25 | Aggregated speed: 53.75 | Accuracy: 74.07% |

[1] https://github.com/kagisearch/pyllms

By @epups - 7 months

The graphs seem to indicate their model trades blows with Llama 3.1 405B, which has more than 3x the number of tokens and (presumably) a much bigger compute budget. It's kind of baffling if this is confirmed.

Apparently Llama 3.1 relied on artificial data, would be very curious about the type of data that Mistral uses.

By @thntk - 7 months

Anyone know what caused the very big performance jump from Large1 to Large2 in just a few months?

Besides, parameter redundancy seems evidenced. Front-tier models used to be 1.8T, then 405B, and now 123B. Would front-tier models in the future be <10B or even <1B, that would be a game changer.

By @moralestapia - 7 months

Nice, they finally got the memo that GPT4 exists and include it in their benchmarks.

By @rkwz - 7 months

> A significant effort was also devoted to enhancing the model’s reasoning capabilities. One of the key focus areas during training was to minimize the model’s tendency to “hallucinate” or generate plausible-sounding but factually incorrect or irrelevant information. This was achieved by fine-tuning the model to be more cautious and discerning in its responses, ensuring that it provides reliable and accurate outputs.

Is there a benchmark or something similar that compares this "quality" across different models?

By @Always42 - 7 months

I'm really glad these guys exist

By @novok - 7 months

I kind of wonder why a lot of these places don't give "amateur" sized models anymore at around the 18B & 30B parameter sizes that you can run on a single 3090 or M2 Max at reasonable speeds and RAM requirements? It's all 7B, 70B, 400B sizing nowadays.

By @doctoboggan - 7 months

The question I (and I suspect most other HN readers) have is which model is best for coding? While I appreciate the advances in open weights models and all the competition from other companies, when it comes to my professional use I just want the best. Is that still GPT-4?

By @avereveard - 7 months

important to note that this time around weights are available https://huggingface.co/mistralai/Mistral-Large-Instruct-2407

By @ThinkBeat - 7 months

A side note about the ever increasing costs to advance the models. I feel certain that some branch of what may be connected to the NSA is running and advancing models that probably exceed what the open market provides today.

Maybe they are running it on proprietary or semi proprietary hardware but if they dont, how much does the market no where various shipments of NVIDEA processors ends up?

I imagine most intelligence agencies are in need of vast quantities.

I presume is M$ announces new availability of AI compute it means they have received and put into production X Nvidiam, which might make it possible to guesstimate within some bounds how many.

Same with other open market compute facilities.

Is it likely that a significant share of NVIDEA processors are going to government / intelligent / fronts?

By @modeless - 7 months

The name just makes me think of the screaming cowboy song. https://youtu.be/rvrZJ5C_Nwg?t=138

By @erichocean - 7 months

I like Claude 3.5 Sonnet, but despite paying for a plan, I run out of tokens after about 10 minutes. Text only, I'm typing everything in myself.

It's almost useless because I literally can't use it.

Update: https://support.anthropic.com/en/articles/8325612-does-claud...

45 messages per 5 hours is the limit for Pro users, less if Claude is wordy in its responses—which it always is. I hit that limit so fast when I'm investigating something. So annoying.

They used to let you select another, worse model but I don't see that option anymore. le sigh

By @demarq - 7 months

Super looking forward to this.

I tried Codestral and nothing came close. Not even slightly. It was the only LLM that consistently put out code for me that was runnable and idiomatic.

By @rldjbpin - 7 months

the way these models are being pushed, it seems like more of one-upping each other through iterative improvements than actual breakthroughs.

these benchmarks are as good as random hardware ones apple or intel pushes to sell their stuff. in the real world, most people will end up with some modifications for their specific use case anyways. for those, i argue, we already have "capable enough" models for the job.

By @teaearlgraycold - 7 months

https://www.youtube.com/watch?v=rvrZJ5C_Nwg

By @Tepix - 7 months

Just in case you haven't RTFA. Mistral 2 is 123b.

By @ashenke - 7 months

I tested it with my claude prompt history, the results are as good as Claude 3.5 Sonnet, but it's 2 or 3 times slower

By @tonetegeatinst - 7 months

What doe they mean by "single-node inference"?

Do they mean inference done on a single machine?

By @greenchair - 7 months

can anyone explain why the % success rates are so different between these programming languages? is this a function of amount of training data available for each language or due to complexity of language or what?

By @zone411 - 7 months

Improves from 17.7 for Mistral Large to 20.0 on the NYT Connections benchmark.

By @ilaksh - 7 months

How does their API pricing compare to 4o and 3.5 Sonnet?

By @philip-b - 7 months

Does any one of the top models have access to the internet and googling things? I want an LLM to look things up and do casual research for me when I’m lazy.

By @htk - 7 months

is it possible to run Large 2 on ollama?

By @mvdtnz - 7 months

Imagine bragging about 74% accuracy in any other field of software. You'd be laughed out of the room. But somehow it's accepted in "AI".

By @gavinray - 7 months

"It's not the size that matters, but how you use it."

By @h1fra - 7 months

There are now more AI models than javascript framework!

By @whisper_yb - 7 months

Every day a new model better than the previous one lol

By @RyanAdamas - 7 months

Personally, language diversity should be the last thing on the list. If we had optimized every software from the get-go for a dozen languages our forward progress would have been dead in the water.

By @breck - 7 months

Large Enough – Mistral AI

Related

Codestral Mamba

Mistral NeMo

Mathstral: 7B LLM designed for math reasoning and scientific discovery

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Run Mistral 7B model using less than 4GB of memory on your Mac with CoreML

Related

Codestral Mamba

Mistral NeMo

Mathstral: 7B LLM designed for math reasoning and scientific discovery

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Run Mistral 7B model using less than 4GB of memory on your Mac with CoreML