July 12th, 2024

Reasoning skills of large language models are often overestimated

Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.

Read original articleLink Icon
Reasoning skills of large language models are often overestimated

Large language models (LLMs) like GPT-4 and Claude excel in familiar tasks but struggle in novel scenarios, revealing a reliance on memorization over reasoning skills, according to recent MIT CSAIL research. The study compared default tasks with counterfactual scenarios, showing that LLMs perform well in common tasks but experience significant performance drops in unfamiliar situations. Tasks like arithmetic, chess, and logical reasoning were used to test the models' abilities, highlighting limitations in generalization to new scenarios. The research suggests that while LLMs may perform strongly in known tasks, their adaptability to diverse situations is limited. The study emphasizes the importance of enhancing LLMs' adaptability and understanding their decision-making processes. Despite the insights gained, the study acknowledges the need for more diverse testing environments to uncover additional weaknesses and improve the interpretability of these models. The research was presented at the North American Chapter of the Association for Computational Linguistics (NAACL) and was supported by various institutions including the MIT–IBM Watson AI Lab and the National Science Foundation.

Related

Link Icon 10 comments
By @armitron - 6 months
The "GPT-4 Can't Reason" paper [1] [2] is excellent and proves without a shadow of a doubt that there is no reasoning taking place. It also raises strong doubts about the efficacy and reliability of multi-agent approaches that rely on LLM-based planning (such as Cradle [3], which is currently on the front page).

When it was previously discussed here, it received a torrent of low-quality comments that can only be described as confirmation bias: commenters tried the examples verbatim (instead of introducing random perturbations to get around model updates) in order to disprove the thesis.

It's sad that there are "engineers" out there so blinded by their own (wishful thinking, vested interests) that can not accept the obvious.

[1] https://arxiv.org/abs/2308.03762

[2] https://medium.com/@konstantine_45825/gpt-4-cant-reason-adde...

[3] https://baai-agents.github.io/Cradle/

By @rco8786 - 6 months
Because there is no "reasoning" right? LLMs create output that appear to have been reasoned about, but this is not at all reality
By @alexsereno - 6 months
I’d love to see this placed into the context of the (actual) average Joe. If you consider that the average person dramatically overestimates their reasoning skills and can’t correctly do high school level math, I think it would help understand if we have reached AGI. It would also just provide some color to the state of things.
By @ArcaneMoose - 6 months
This is why I find ARC-AGI so fascinating - it really seems like a great benchmark to test for this sort of behavior. Hope that it doesn't get solved through a brute-force memorization approach. I do wonder how reasoning is different from other sorts of 'thinking' if it's all neurons firing in the brain.
By @shermantanktop - 6 months
Chord fingering is mentioned. I’ve asked ChatGPT questions about drop-2 voicings and it explained drop-2 and then gave examples of drop-3. I told it that it was in correct and it then changed some wording but repeated the incorrect examples.
By @apsec112 - 6 months
FWIW the version of Claude they use is quite old, v1.3. I'd be curious to see how the new Claude 3.5 performs on these