Apple study proves LLM-based AI models are flawed because they cannot reason
Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.
Read original articleA recent study by Apple's artificial intelligence team has revealed significant shortcomings in large language models (LLMs) from companies like Meta and OpenAI, particularly in their reasoning abilities. The researchers introduced a new benchmark called GSM-Symbolic to evaluate the reasoning capabilities of these models. Their findings indicate that minor changes in the wording of queries can lead to drastically different answers, highlighting the models' unreliability. The study specifically examined the "fragility" of mathematical reasoning, showing that adding seemingly relevant contextual information can reduce the accuracy of answers by up to 65%. For instance, a math problem involving the total number of kiwis picked by a character named Oliver was miscalculated by LLMs when irrelevant details were included. The researchers concluded that these models lack formal reasoning skills and operate primarily through sophisticated pattern matching, which is highly sensitive to changes in input. This fragility raises concerns about the viability of developing reliable AI agents based on current LLM technology.
- Apple's research highlights the reasoning failures of LLMs from major AI companies.
- The GSM-Symbolic benchmark was proposed to assess reasoning capabilities in LLMs.
- Minor changes in query wording can lead to significantly different answers.
- Adding irrelevant context can drastically reduce the accuracy of mathematical reasoning.
- The study concludes that LLMs rely on pattern matching rather than formal reasoning.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
Can Large Language Models Understand Symbolic Graphics Programs?
The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.
Understanding the Limitations of Mathematical Reasoning in Large Language Models
The paper "GSM-Symbolic" examines the limitations of Large Language Models in mathematical reasoning, introducing a new benchmark and revealing performance variability and struggles with logical reasoning in LLMs.
LLMs don't do formal reasoning
A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.
""" We add seemingly relevant statements to the questions that are, in fact, irrelevant to the reasoning and conclusion. However, the majority of models fail to ignore these statements and blindly convert them into operations, leading to mistakes prompt: Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
o1-mini response: Oliver picks kiwis over three days as follows:
- Friday: 44 kiwis
- Saturday: 58 kiwis
- Sunday: Double the number he picked on Friday, which is 2 × 44 = 88 kiwis
However, on Sunday, 5 of these kiwis were smaller than average. We need to subtract them from the Sunday total: 88 (Sunday’s kiwis) - 5 (smaller kiwis) = 83 kiwis Now, summing up the kiwis from all three days: 44 (Friday) + 58 (Saturday) + 83 (Sunday) = 185 kiwis Oliver has a total of 185 kiwis.
"""
I tried asking exactly the same question to claude 3.5 and the response was:
"""
Let's break this problem down step by step:
1. Friday's kiwis:
Oliver picked 44 kiwis on Friday.
2. Saturday's kiwis: Oliver picked 58 kiwis on Saturday.
3. Sunday's kiwis: Oliver picked double the number of kiwis he did on Friday.
Friday's kiwis = 44
Double of Friday's kiwis = 44 × 2 = 88 kiwis
However, 5 of these were smaller than average, but they still count in the total.
4. Total kiwis: Friday's kiwis + Saturday's kiwis + Sunday's kiwis
= 44 + 58 + 88
= 190 kiwis
Therefore, Oliver has a total of 190 kiwis."""
This sounds like they're inspecting existing models. Maybe a model trained specifically on "word problem" question-answer pairs (as in, the sort of things that show up on tests and always pretend that the sort of complications a domain expert would know about just don't exist) would do better?
- article is titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models"
- some people having been flogging it as "LLMs cannot reason"
- it shows a 6-8 point drop, in test results in the 80s, if you replace the #s in the test set problems with random #s, and run multiple times
- If anything, sounds like a huge W to me: very hard to claim they're just memorizing with that small of a drop
The study proves nothing of the sort. Even the results of 4o are enough to give pause to this conclusion.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
Can Large Language Models Understand Symbolic Graphics Programs?
The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.
Understanding the Limitations of Mathematical Reasoning in Large Language Models
The paper "GSM-Symbolic" examines the limitations of Large Language Models in mathematical reasoning, introducing a new benchmark and revealing performance variability and struggles with logical reasoning in LLMs.
LLMs don't do formal reasoning
A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.