July 19th, 2024

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Mathstral is a new 7B model by Mistral AI for math reasoning, with a 32k context window and Apache 2.0 license. It aims to improve common sense in math problem-solving, deployable locally with LlamaEdge and shareable via GaiaNet for customization and integration.

Read original article

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Mathstral is a new 7B model by Mistral AI tailored for math reasoning and scientific exploration, featuring a 32k context window and available under the Apache 2.0 license. While advanced LLMs like GPT-4o can tackle complex math problems, they may lack common sense, as highlighted by their struggle to discern basic math concepts. Mathstral aims to bridge this gap, showcasing its ability to reason through a common math question accurately. The model can be run locally using LlamaEdge, a Rust + Wasm stack, simplifying deployment without complex toolchains. Additionally, GaiaNet project enables sharing Mathstral with friends and customizing its usage, offering an OpenAI-compatible API endpoint and a web-based chatbot UI. This trend emphasizes how fine-tuned open-source models can excel in specialized domains compared to larger closed-source counterparts. GaiaNet goes beyond model deployment, allowing prompt manipulation, context addition, and integration of proprietary knowledge bases for more grounded responses.

Gemma 2 on AWS Lambda with Llamafile

Google released Gemma 2 9B, a compact language model rivaling GPT-3.5. Mozilla's llamafile simplifies deploying models like LLaVA 1.5 and Mistral 7B Instruct, enhancing accessibility to powerful AI models across various systems.

Reasoning skills of large language models are often overestimated

Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.

Codestral Mamba

Codestral Mamba, a new Mamba2 language model by Mistral AI, excels in code generation with linear time inference and infinite sequence modeling. It rivals transformer models, supports 256k tokens, and aids local code assistance. Deployable via mistral-inference SDK or TensorRT-LLM, it's open-source under Apache 2.0.

Mistral NeMo

Mistral AI introduces Mistral NeMo, a powerful 12B model developed with NVIDIA. It features a large context window, strong reasoning abilities, and FP8 inference support. Available under Apache 2.0 license for diverse applications.

Mathstral: 7B LLM designed for math reasoning and scientific discovery

MathΣtral, a new 7B model by Mistral AI, focuses on math reasoning and scientific discovery, inspired by Archimedes and Newton. It excels in STEM with high reasoning abilities, scoring 56.6% on MATH and 63.47% on MMLU. The model's release under Apache 2.0 license supports academic projects, showcasing performance/speed tradeoffs in specialized models. Further enhancements can be achieved through increased inference-time computation. Professor Paul Bourdon's curation of GRE Math Subject Test problems contributed to the model's evaluation. Instructions for model use and fine-tuning are available in the documentation hosted on HuggingFace.

13 comments

By @vessenes - 9 months

Wait wait wait… the json output is incorrect, full stop. It claims the first decimal digit of 9.9 is ‘0’. Mathstral might be great; it might be terrible; either way this particular test should be done first at 0 temp and then like 50 or 100 times at 0.7 temp, but in any event the writer owes it to themselves (and us) to notice that the claimed ‘good’ output is totally incorrect.

By @xanderlewis - 9 months

> As we have seen, leading edge LLMs, such as the GPT-4o, can solve very complex math problems.

No… they can’t. That’s like saying a search engine can solve math problems — which it can, in a sense.

I suspect that the people repeatedly saying this simply lack the knowledge to know what really constitutes a ‘complex math problem’.

And of course any half-decent new model can answer this particular question correctly; the designers aren’t stupid or unaware of what the expectations and common traps are. The model itself probably will be able to talk about why testing on such comparisons would be interesting (because it ‘knows’ about how this being a recent meme).

By @lucabetelci - 9 months

In the JSON response (after "And the response is the following.") it says that "(...) Since 1 (from 9.11) is greater than 0 (implicitly, as there's no second digit in 9.9), we can conclude that:\n\n$$9.11 > 9.9$$ (...)"

By @CharlesW - 9 months

This was all over Threads last week, posted by anti-AI people who who don't know how LLMs work. These are the same people who post screenshots of LLMs attempting to count the number of 'r's in "strawberry".

> "The 7B mathstral model answers the math common sense question perfectly with the correct reasoning."

Answers perfectly, sure. But the word "reasoning" is anthropomorphism and promises a level of cognitive ability that LLMs do not possess.

By @g-w1 - 9 months

I'm quite confused. In the article, the response from mathstral is also wrong???

By @TZubiri - 9 months

None of them is wrong, the answer depends on the type of the object, which the notation doesn't specify:

Version 9.11 is greater than 9.9

Decimal 9.9 is greater than 9.11

By @arnaudsm - 9 months

Naive question: will scaling laws be sufficient for reliable reasoning, or are transformer architectures incapable of that ?

By @meisel - 9 months

> The case in point is that most LLMs, including GPT-4o, cannot tell whether 9.11 or 9.8 is bigger!

Wrong. GPT-4o gives me the correct answer to this question, 9.8.

By @rahduro - 9 months

It might have something to do with quantization though, I have used the Q6_K version from https://huggingface.co/bartowski/mathstral-7B-v0.1-GGUF with Llamafile. It always shows 9.11 is bigger than 9.9.

By @hdhshdhshdjd - 9 months

This is like 200x more complicated setup than just running Ollama.

By @einarfd - 9 months

I found this interesting and tried the question with the top models from Antrophic, Openai, Google and Mistral. Which all gave the wrong results. But if you preface the question with "Of these two decimal numbers ", the answers changed and the results where correct. I suspect what we are seeing is that the models handles the numbers as version numbers, and not decimal numbers. This is disappointing and confusing, but it also imo. underlines that giving them context on what you try to get them to do is worthwhile.

By @bee_rider - 9 months

Which is the correct answer?

(Note that the logic in the response from the LLM is blatantly nonsense).

By @3Sophons - 9 months

AI can not handle basic math like deciding which is greater between 9.11 and 9.9? A popular meme sparks debates about LLM's grasp of elementary math. Introducing mathstral, Mistral AI's latest opensource model, fine-tuned specifically for mathematical reasoning and scientific discovery. I just ran a series of tests to determine if mathstral can truly discern the larger of two decimal numbers in a way that makes sense to us humans. Using LlamaEdge's Rust + Wasm tech stack, I set up mathstral on my local machine—no complex installations needed! The results? Absolutely fascinating and promising for the future of AI in education and beyond. Want to see how it performed and possibly set it up yourself? Check out this detailed easy-to-follow walkthrough

Can the New Mathstral LLM Accurately Compare 9.11 and 9.9?

Related

Gemma 2 on AWS Lambda with Llamafile

Reasoning skills of large language models are often overestimated

Codestral Mamba

Mistral NeMo

Mathstral: 7B LLM designed for math reasoning and scientific discovery

Related

Gemma 2 on AWS Lambda with Llamafile

Reasoning skills of large language models are often overestimated

Codestral Mamba

Mistral NeMo

Mathstral: 7B LLM designed for math reasoning and scientific discovery