October 12th, 2024

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.

Read original articleLink Icon
Apple study proves LLM-based AI models are flawed because they cannot reason

A recent study by Apple's artificial intelligence team has revealed significant shortcomings in large language models (LLMs) from companies like Meta and OpenAI, particularly in their reasoning abilities. The researchers introduced a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of these models. Their findings indicate that minor changes in the wording of queries can lead to drastically different answers, highlighting the models' unreliability. The study specifically examined the "fragility" of mathematical reasoning, showing that adding seemingly relevant contextual information can reduce the accuracy of answers by up to 65%. For instance, a math problem involving the total number of kiwis picked by a character named Oliver was miscalculated by LLMs when irrelevant details were included. The researchers concluded that these models lack formal reasoning skills and operate primarily through sophisticated pattern matching, which is highly sensitive to changes in input. This fragility raises concerns about the viability of developing reliable AI agents based on current LLM technology.

- Apple's research highlights the reasoning failures of LLMs from major AI companies.

- The GSM-Symbolic benchmark was proposed to assess reasoning capabilities in LLMs.

- Minor changes in query wording can lead to significantly different answers.

- Adding irrelevant context can drastically reduce the accuracy of mathematical reasoning.

- The study concludes that LLMs rely on pattern matching rather than formal reasoning.

Link Icon 11 comments
By @dboreham - 4 months
What an amazing time we live in when people are discussing how perhaps the output from a bunch of matrix convolutions possibility can't quite reason about mathematics.
By @fourgreen - 4 months
The benchmark itself is interesting, but we should notice that the authors didn't test it on claude 3.5. From my experience, for tasks which need intelligence, claude 3.5 is much better than gpt or llama (which authors used). So the conclusion is not than "LLM-based AI models are flawed because they cannot reason" but rather at most "gpt o1-min and Llama3-8B are flawed because they cannot reason". For instance, authors do this test:

""" We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes prompt: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

o1-mini response: Oliver picks kiwis over three days as follows:

- Friday: 44 kiwis

- Saturday: 58 kiwis

- Sunday: Double the number he picked on Friday, which is 2 × 44 = 88 kiwis

However, on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) - 5 (smaller kiwis) = 83 kiwis Now, summing up the kiwis from all three days: 44 (Friday) + 58 (Saturday) + 83 (Sunday) = 185 kiwis Oliver has a total of 185 kiwis.

"""

I tried asking exactly the same question to claude 3.5 and the response was:

"""

Let's break this problem down step by step:

1. Friday's kiwis:

   Oliver picked 44 kiwis on Friday.
2. Saturday's kiwis:

   Oliver picked 58 kiwis on Saturday.
3. Sunday's kiwis:

   Oliver picked double the number of kiwis he did on Friday.

   Friday's kiwis = 44

   Double of Friday's kiwis = 44 × 2 = 88 kiwis

   However, 5 of these were smaller than average, but they still count in the total.
4. Total kiwis:

   Friday's kiwis + Saturday's kiwis + Sunday's kiwis

   = 44 + 58 + 88

   = 190 kiwis
Therefore, Oliver has a total of 190 kiwis.

"""

By @rahimnathwani - 4 months
By @unwise-exe - 4 months
Interesting. I would have thought that the training set (basically the whole internet AIUI) would have included various "teacher's version" exams with enough word problems with intentionally-distracting extra information, that the models would be able to ignore that sort of thing.

This sounds like they're inspecting existing models. Maybe a model trained specifically on "word problem" question-answer pairs (as in, the sort of things that show up on tests and always pretend that the sort of complications a domain expert would know about just don't exist) would do better?

By @refulgentis - 4 months
This is an interesting acid test:

- article is titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"

- some people having been flogging it as "LLMs cannot reason"

- it shows a 6-8 point drop, in test results in the 80s, if you replace the #s in the test set problems with random #s, and run multiple times

- If anything, sounds like a huge W to me: very hard to claim they're just memorizing with that small of a drop

By @apsec112 - 4 months
This is silly. Humans will also get fewer right answers if you make the question more complex (requiring additional steps), or if you add irrelevant information as a distraction (since on tests, there's usually an assumption that all information given is relevant). As for changing names and numbers, the large effects they saw were all on small (<10B param) open source models; the effects on o1 were tiny and barely distinguishable from noise.
By @mvdtnz - 4 months
Gosh I wish someone would pay me handsomely for coming up with such stupidly obvious "research" results as "a computer program that uses statistics to pick the next word in a sequence doesn't reason like a person".
By @og_kalu - 4 months
They tested o1-preview but the results hidden away in the Appendix, probably because o1-preview's "drops" for 4 out of 5 of the new benchmarks are all within the margin of error. i.e 94.9 on the full GSM8K and 93.6, 92.7, 95.4, 94 and 77.4 on the modified benchmarks.

The study proves nothing of the sort. Even the results of 4o are enough to give pause to this conclusion.