July 12th, 2024

Reasoning skills of large language models are often overestimated

Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.

Read original article

Reasoning skills of large language models are often overestimated

Large language models (LLMs) like GPT-4 and Claude excel in familiar tasks but struggle in novel scenarios, revealing a reliance on memorization over reasoning skills, according to recent MIT CSAIL research. The study compared default tasks with counterfactual scenarios, showing that LLMs perform well in common tasks but experience significant performance drops in unfamiliar situations. Tasks like arithmetic, chess, and logical reasoning were used to test the models' abilities, highlighting limitations in generalization to new scenarios. The research suggests that while LLMs may perform strongly in known tasks, their adaptability to diverse situations is limited. The study emphasizes the importance of enhancing LLMs' adaptability and understanding their decision-making processes. Despite the insights gained, the study acknowledges the need for more diverse testing environments to uncover additional weaknesses and improve the interpretability of these models. The research was presented at the North American Chapter of the Association for Computational Linguistics (NAACL) and was supported by various institutions including the MIT–IBM Watson AI Lab and the National Science Foundation.

10 comments

By @armitron - 10 months

The "GPT-4 Can't Reason" paper [1] [2] is excellent and proves without a shadow of a doubt that there is no reasoning taking place. It also raises strong doubts about the efficacy and reliability of multi-agent approaches that rely on LLM-based planning (such as Cradle [3], which is currently on the front page).

When it was previously discussed here, it received a torrent of low-quality comments that can only be described as confirmation bias: commenters tried the examples verbatim (instead of introducing random perturbations to get around model updates) in order to disprove the thesis.

It's sad that there are "engineers" out there so blinded by their own (wishful thinking, vested interests) that can not accept the obvious.

[1] https://arxiv.org/abs/2308.03762

[2] https://medium.com/@konstantine_45825/gpt-4-cant-reason-adde...

[3] https://baai-agents.github.io/Cradle/

By @rco8786 - 10 months

Because there is no "reasoning" right? LLMs create output that appear to have been reasoned about, but this is not at all reality

By @alexsereno - 10 months

I’d love to see this placed into the context of the (actual) average Joe. If you consider that the average person dramatically overestimates their reasoning skills and can’t correctly do high school level math, I think it would help understand if we have reached AGI. It would also just provide some color to the state of things.

By @ArcaneMoose - 10 months

This is why I find ARC-AGI so fascinating - it really seems like a great benchmark to test for this sort of behavior. Hope that it doesn't get solved through a brute-force memorization approach. I do wonder how reasoning is different from other sorts of 'thinking' if it's all neurons firing in the brain.

By @cs702 - 10 months

Paper: https://arxiv.org/abs/2407.02678

Previous discussion: https://news.ycombinator.com/item?id=40899309

By @shermantanktop - 10 months

Chord fingering is mentioned. I’ve asked ChatGPT questions about drop-2 voicings and it explained drop-2 and then gave examples of drop-3. I told it that it was in correct and it then changed some wording but repeated the incorrect examples.

By @apsec112 - 10 months

FWIW the version of Claude they use is quite old, v1.3. I'd be curious to see how the new Claude 3.5 performs on these

Reasoning skills of large language models are often overestimated

Related

Related