October 11th, 2024

Understanding the Limitations of Mathematical Reasoning in Large Language Models

The paper "GSM-Symbolic" examines the limitations of Large Language Models in mathematical reasoning, introducing a new benchmark and revealing performance variability and struggles with logical reasoning in LLMs.

Read original articleLink Icon
DisappointmentSkepticismCuriosity
Understanding the Limitations of Mathematical Reasoning in Large Language Models

The paper titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" explores the mathematical reasoning capabilities of Large Language Models (LLMs) using the GSM8K benchmark. Despite improvements in LLM performance on this benchmark, the authors question the reliability of these metrics, suggesting that LLMs may not have genuinely advanced in mathematical reasoning. To address this, they introduce GSM-Symbolic, a new benchmark that generates diverse questions from symbolic templates, allowing for more controlled evaluations. The study reveals significant performance variability in LLMs when faced with different instantiations of the same question, particularly when numerical values are altered. The research indicates that LLMs struggle with logical reasoning, as their performance declines sharply (up to 65%) with the addition of irrelevant clauses in questions. This suggests that LLMs replicate reasoning steps from training data rather than performing true logical reasoning. Overall, the findings provide a deeper understanding of the limitations of LLMs in mathematical reasoning tasks.

- The GSM8K benchmark is commonly used to evaluate LLMs' mathematical reasoning.

- The new GSM-Symbolic benchmark offers improved evaluation metrics for LLMs.

- LLMs show significant performance drops when numerical values in questions are changed.

- The addition of irrelevant clauses can drastically reduce LLM performance.

- The study highlights the difference between replicating reasoning steps and genuine logical reasoning in LLMs.

AI: What people are saying
The comments on the "GSM-Symbolic" paper reveal several key insights regarding the limitations of Large Language Models (LLMs) in mathematical reasoning.
  • Many commenters draw parallels between LLM performance and human reasoning, suggesting that both exhibit similar limitations, especially when faced with complex or irrelevant information.
  • There is a consensus that LLMs struggle with logical reasoning and mathematical tasks, particularly when the questions are altered or contain extraneous details.
  • Some argue that the current benchmarks for LLMs may not accurately reflect their reasoning capabilities, indicating potential overfitting and a need for better evaluation methods.
  • Several comments emphasize the importance of developing systems that integrate LLMs with other reasoning tools to enhance their performance.
  • There is a call for more research and investment in improving mathematical reasoning capabilities in LLMs rather than focusing solely on advanced AI concepts like AGI.
Link Icon 32 comments
By @parsimo2010 - 4 months
I won't take a strong stance on whether or not LLMs actually do reasoning, but I will say that this decrease in performance is similar to what I see in college freshmen (I'm currently teaching a calculus course in which almost half of the students took AP calc in high school). They perform well on simple questions. Requiring students to chain multiple steps together, even simple steps, results in decreased accuracy and higher variance (I have no data on whether this decrease is linear or not, as the paper assumes that the decrease should be linear with the number of steps). We see similar results with adding unrelated statements into a problem- many students are trained to make sure to use all given information in solving a problem- if you leave out something that the instructor gives you, then you probably forgot to do something important.

So while I don't take a stance on what an LLM does should be considered reasoning, I do think that SOTA LLMs like GPT-4o perform about as good as high school graduates in America with average intelligence. In other words, average Americans exhibit similar limitations on their reasoning as good LLMs. Which on the one hand is a little disappointing to me in terms of the human performance but is kind of good news for LLMs- they aren't doing graduate-level research but they are already capable of helping a large portion of the population.

By @woopwoop - 4 months
This paper, among other things, shows that LLMs have dramatically worse performance on basic algebra questions when you add in irrelevant information. The examples are things like "John picked 43 kiwis on Monday, 24 kiwis on Tuesday. On Wednesday, 5 of the kiwis he picked were smaller than usual. Altogether, on Monday, Tuesday, and Wednesday, John picked 87 kiwis. How many kiwis did John pick on Wednesday?" In this question, the remark about some of the kiwis on Wednesday being small is irrelevant, but adding things like this reduces performance on a popular benchmark from 95% to 77% for GPT-4o, for example.

I don't find this very impressive. Forget LLMs for a second. Let's say _you_ read a question of that kind with some bit of irrelevant information. There are two possibilities you have to consider: the question may as well have excluded the irrelevant information, or the question was miswritten and the irrelevant information was meant to be relevant. The latter is a perfectly live possibility, and I don't think it's a dramatic failure to assume that this is correct. I have to confess that when I read some people's LLM gotcha questions, where they take some popular logic puzzle and invert things, I think I would get them "wrong" too. And not wrong because I don't understand the question, but wrong because with no context I'd just assume the inversion was a typo.

By @s-macke - 4 months
These results are very similar to the "Alice in Wonderland" problem [1, 2], which was already discussed a few months ago. However the authors of the other paper are much more critical and call it a "Complete Reasoning Breakdown".

You could argue that the issue lies in the models being in an intermediate state between pattern matching and reasoning.

To me, such results indicate that you can't trust any LLM benchmark results related to math and reasoning when you see, that changing the characters, numbers or the sentence structure in a problem alter the outcome by more than 20 percentage points.

[1] https://arxiv.org/html/2406.02061v1

[2] https://news.ycombinator.com/item?id=40811329

By @bob1029 - 4 months
> we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning

I'd offer a simpler explanation: Tokenization.

If you tokenize "12345 * 27271" you will get the following:

  "123", "45", " *", " ", "272", "71"
The statistical likelihood that any of these tokens predicts any of the others is completely meaningless in the context of simple arithmetic.

You can argue that this is where tool use comes in (and I would be inclined to agree), but I don't think this bodes well for "genuine logical reasoning".

By @dev1ycan - 4 months
I don't understand the idiocracy we live in, it is beyond obvious not just that the stock market is a bubble but ESPECIALLY the AI related stocks are a massive bubble, when it pops, and it will, it is going to be very very ugly, yet people keep pouring in, as Sabine said it, it's starting to look like particle physics where they keep asking for bigger colliders, just because you have a bigger collider, if your methodology is flawed you aren't gonna get any more significant returns.

Eventually they will run out of exponential cash to pour in, and investors will start asking questions, stocks are already valued at 60x+ their earnings, whenever it pops you don't want to be the one who bought the top.

Guess it's still gonna take a while more for the layman to realize the issues with LLMs, but it'll happen.

By @trehalose - 4 months
I see a lot of discussion about irrelevant clauses tripping up the LLMs and why that does or doesn't matter. To me, what's far more damning is this:

> Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark.

This seems like irrefutable evidence of overfitting, that in the best case scenario is epidemic among current LLMs (and in the worst case interpretation, is covering up fundamental inabilities to learn mathematical reasoning from the training data).

By @thenoblesunfish - 4 months
Very interesting, and aligns with what I would expect in terms of the type of "thinking" LLMs do. I think that it's also the type of "thinking" that will let a student pass most school courses, except of course for the ones where the teacher has taken the time to pose test questions that aren't as amenable to pattern matching. (Hard, but I assume most readers here are familiar with leetcode style interviews and what makes questions of that kind higher or lower quality for assessing candidates)

(And yes, I know people are hard at work adding other types of thinking to work along with the pure language models)

By @yk - 4 months
I test llms actually similar. For example there is a well known logic puzzle were a farmer tries to cross a river with a cabbage a goat and a wolf. Llms can solve that since at least GPT-2, however if we replace the wolf with a cow, gpt-o does correctly infer the rules of the puzzle but can't solve it.
By @criddell - 4 months
It would be interesting if this kind of work could ever be extended to show the limitations of mathematical reasoning in animals and humans.

For example, just as a dog will never understand a fourier transform, there are likely ideas that humans cannot understand. If we know what our limits are, I wonder if we could build machines that can reason in ways we aren't capable of?

By @codelion - 4 months
This is surprising to only those that have not worked in formal reasoning. Yes, LLMs cannot do true logical reasoning in a formal sense, you can do better with an SMT solver. But it is also true that you can solve a lot of logical problems by just applying “reasoning steps” from the training data, specially when your training data is the entirety of written content ever produced. Both of these can be true at the same time it is not a contradiction just an interesting dichotomy.
By @dang - 4 months
Related ongoing thread:

LLMs don't do formal reasoning - https://news.ycombinator.com/item?id=41812523 - Oct 2024 (70 comments)

By @singularity2001 - 4 months
If the argument is that LLMs are bad at reasoning because they are easily distractible and the results vary with modifications in the question, one should be reminded of the consistency and distractability of humans.
By @K0balt - 4 months
Trying to solve (much less explore) mathematics using probabilistic next-token prediction seems like the really long way around, especially when we have pretty good deterministic tools available for our use. I don’t know why anyone would bother doing anything besides working on the correct manipulation of tools.

Brains have various structures that have distinct architectures. I don’t see any indication that the best way forward is to try to shoehorn everything into a single computational paradigm.

It’s like trying to make a flying submarine car. It might technically be possible, but it might not be worth the trouble, and it’s unlikely to result in a vehicle that works excellently in any of its environments.

By @gradientsrneat - 4 months
Could this be Goodhart's Law in action? AI tools like to showcase benchmarks in bar graphs to show how well they perform compared to other models.

Maybe the benchmark Qs/As snuck into training sets accidentally. Is it still Goodhart's Law if it's unintentional?

Daniel Lemire has blogged about being impressed with how well the LLM answers his CS problem questions. I was impressed too. Not sure where the line of competence lies.

By @eigenform - 4 months
The difference is that, if we are solving a math problem together, you and I [explicitly or implicitly] can come to an agreement over the context and decide to restrict our use of language with certain rules. The utility behind our conversation [generally] rests on those rules!

An LLM is very good at recovering rules, but being good at pattern recognition is not the same thing as being good at unambiguously following rules in the appropriate context.

edit: Natural language is far from an efficient/sufficient/necessary intermediate representation for doing math, just ask any general-purpose computer. Sometimes, it's worth "putting rules in stone," and it seems unreasonable to believe that there is always an unambiguous rule for this that you can mechanically recover from a corpus of language use.

By @i007 - 4 months
LLMs are designed to carry out "associative reasoning" which captures logic based on recognition and recall of compositional patterns learned during training.

Having said that, we can still get semantically and logically idempotent output that makes sense but with lots of work outside of the LLM, which contrasts with the current hyper focus on the LLM itself as the be all and end all. It is just one component in what ought to be a larger and more involved system for reasoning.

Look at what we were able to accomplish here for Legal AI, not so mathematical logic per se but mimicking (capturing) axiomatic logic in the legal domain:

https://www.youtube.com/watch?v=_9Galw9-Z3Q

marc at sunami dot ai

By @jgord - 4 months
I propose 'gords rule' : "any sufficiently advanced LLM will learn the laws of logic, the principles of scientific method, and Reinforcement Learning"

until that happens .. I think RL startups focused on real problems are much undervalued : https://quantblog.wordpress.com/2024/10/11/llm-hype-means-th...

By @gtsop - 4 months
LLMs are inherently emulators of digitaly imprinted artifacts of human consciousness. When people trully grasp what this means they will stop being buffled by the fact that LLMs performance deteriorate when novelty of the task increases.

EDIT: Had there been an ounce of actual true reasoning emerging in LLMs, openai would have been running this thing privatly 24/7 to produce new science and capture pattents that would give them economic dominance. Not trying to sell tokens to us all.

By @ak_111 - 4 months
As an outsider can anyone enlighten me how this squares with the news that models that adapt similar LLM architecture can obtain silver medal in mathematical olympiad?
By @uptownfunk - 4 months
The very fundamental problem with LLM is there is no guarantee on any of the reasoning it gives you without a human there to give a thumbs up. They are working on solving this (alpha proof, lean agent etc) but getting this to run at inference time in an optimized way is what I would call one of the millenial prize problems of AI which will lead to a quantum leap in the path towards the singularity.
By @woopwoop - 4 months
I'm curious about what happens with the no-op dataset if you include in the prompt that the questions may contain irrelevant information.
By @teleforce - 4 months
In terms of usefulness and realistic implementation mathematical reasoning is the next frontier of LLM not autonomous level 5 driving or AGI. More research fund and investment are much better spent on the former rather than the latter but apparently it seems that the reverse situation is the case.
By @resters - 4 months
I think it's obvious that LLMs will be able to do "reasoning" far better than humans. We must separate our notion of what is remarkably human. Rarely is it the reasoning, it's the intuition that a logical path exists -- for example a mathematical proof that draws from separate sub-disciplines of mathematics, etc.

Consider that in a LLM, language inputs are tokenized and fed as inputs into the neural network, and connections in the network create output sequences that are not just syntactically correct (trivial) or form semantically plausible sentences (early transformers did this). LLM output sequences follow the deep patterns of language which include sometjhing that resembles reasoning as the model has learnt from its training data.

LLMs seem to fall short because they often fail at truly abstract reasoning tasks that humans find easy. If trained properly, LLMs can develop advanced representations of logical systems that will surely outpace what humans can do in terms of raw reasoning.

However, human mathematicians have not even unified around constructive mathematics as a must for the study of mathematics. This reveals that even highly evolved mathematical disciplines rely on objects whose characteristics do not lend themselves to full logical scrutiny and are in a way socially constructed and effectively hard to audit.

While notation in mathematics is incredible technology it is also a highly limiting factor that suffers major tradeoffs. Humans struggle to invent new notation fast enough and to discard outdated notation fast enough. If we do see an AI-powered boom in mathematics, I suspect our notion of notation and the fluidity we demand from it will change dramatically.

By @dr_dshiv - 4 months
It seems incredibly easy to generate an enormous amount of synthetic data for math. Is that happening? Does it work?
By @Animats - 4 months
It's an expected result.

Whatever happened with that result which found some representation of the state of a game inside an LLM? That indicated some degree of model-building. Haven't heard about that again/

By @qwerty456127 - 4 months
Can't al LLM just detect a mathematical reasoning task then produce a formula (not even display it in the production mode) to invoke on an external service engineered for formal logical and mathematical computations?
By @bubble12345 - 4 months
Can LLMs even do addition, with say 20+ digit numbers? Multiplication?
By @jumploops - 4 months
> Overall, while o1-preview and o1-mini exhibit significantly stronger results compared to current open models—potentially due to improved training data and post-training procedures—they still share similar limitations with the open models.

tl;dr - the best open model dropped from 89.7% on GSM8K(full) to 30% on Symbolic-NoOp, while o1-preview dropped from 94.9% to 77.4%, respectively.

I think all this paper shows is that LLMs need space to "think" outside of their inference layer, (for the current architectures at least).

It's similar to the "draw a room, but DO NOT put an elephant in the corner" prompts that people were using with image models.

This is something that practitioners have been doing for awhile (via CoT, ToT, etc.) and the whole rationale behind OpenAI's newly launched o1-series "model."

There's another post that says this paper proves LLMs can't be used to build "reliable agents" -- which doesn't appear to be true when you look at o1's stellar performance here.

By @beardyw - 4 months
I honestly can't see why LLMs should be good at this sort of thing. I am convinced you need a completely different approach. At the very least you mostly only want one completely correct result. Good luck getting current models to do that.
By @throwaway918299 - 4 months
limitations of mathematical reasoning?

They have none. Literally zero. That’s the limit. Thank you for reading my paper.

By @apsec112 - 4 months
()