October 14th, 2024

LLMs can't perform "genuine logical reasoning," Apple researchers suggest

A study by Apple engineers shows large language models struggle with logical reasoning, exhibiting significant accuracy drops with minor changes to problems, indicating reliance on pattern matching over true understanding.

Read original articleLink Icon
LLMs can't perform "genuine logical reasoning," Apple researchers suggest

A recent study by Apple engineers reveals significant limitations in the reasoning capabilities of large language models (LLMs). The research, titled "GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models," demonstrates that LLMs struggle with logical inference when faced with minor modifications to benchmark problems. By altering names and numbers in the GSM8K dataset, the researchers found that LLMs exhibited a notable drop in accuracy, indicating that these models rely on probabilistic pattern matching rather than genuine logical reasoning. Performance varied significantly across different runs, with some models experiencing accuracy drops of up to 15%. Furthermore, when irrelevant information was introduced into the questions, accuracy plummeted by as much as 65.7%. This suggests that LLMs do not truly understand the problems they are solving but instead mimic reasoning based on their training data. The findings align with previous research indicating that LLMs lack formal reasoning capabilities and highlight the need for advancements in AI that incorporate true symbol manipulation. Experts argue that future improvements in AI will require models to develop a deeper understanding of logic and abstract concepts, rather than relying solely on pattern recognition.

- Apple study reveals LLMs' reasoning capabilities are fragile and unreliable.

- Minor changes to benchmark problems lead to significant drops in accuracy.

- LLMs rely on pattern matching rather than genuine logical reasoning.

- Introducing irrelevant information can cause catastrophic performance failures.

- Future AI advancements may require true understanding of logic and abstract concepts.

Link Icon 12 comments
By @rahimnathwani - 4 months
Discussed the day before yesterday: https://news.ycombinator.com/item?id=41823822

And the day before that: https://news.ycombinator.com/item?id=41808683

By @wkat4242 - 4 months
LLMs were never designed for this. In Apple's language: "you're holding it wrong".

It's an impressive technology but its limits are highly overlooked in the current hype cycle.

AI researchers have known this from the start and won't be surprised by this because it was never intended to be able to do this.

The problem is the customers who are impressed by the human-sounding bot (sounding human is exactly what an LLM is for) and mentally ascribe human skills and thought processes to it. And start using it for things it's not, like an oracle of knowledge, a reasoning engine or a mathematics expert.

If you want to have knowledge, go to a search engine (a good one like kagi) which can be ai assisted like perplexity. If you want maths, go to Wolfram Alpha. For real reasoning we need a few more steps on the road to general AI.

This is the problem with hypes. People think a tech is the be all end all for everything and no longer regard its limitations. The metaverse hype saw the same problem even though there's some niche usecases where it really shines.

But now it's labelled as a flop because the overblown expectation of all the overhyped investors couldn't be met.

What an LLM is great at is the human interaction part. But it needs to be backed by other types of AI that can actually handle the request and for many usecases this tech still needs to be invented. What we have here is a toy dashboard that looks like one of a real car, except it's not connected to one. The rest will come but it'll take a lot more time. Meanwhile making LLMs smarter will not really solve the problem that they're inherently not the tool for the job they're being used for.

By @gota - 4 months
This seems to be a comprehensive repeat of the "Rot13" and "Mystery Blocks world" experiments as described by Prof. Subbarao Kambhampati

Rot13 meaning that LLMs can't do Rot 3, 4, ..., n except for Rot13 (because that' in the training data)

Mystery Blocks World being a trivial "translation" (by direct replacement of terms) of a simple Blocks World. The LLMs can solve the original, but not the "translation" - susprisingly, even when provided with the term replacements!

Both are discussed in Prof. Subbarao's Machine Learning Street Talks episode

By @TexanFeller - 4 months
A couple years ago I heard an interview with someone that has stuck with me. He said that human “reasoning” didn’t evolve to reason in the logical sense, but to _provide reasons_ likely to be accepted by other humans, allowing better survival by manipulating other humans. This matches my perception of most people’s reasoning.

What’s funny is that AI is now being trained by a human accepting or rejecting its answers, probably not on the basis of the rigor of the answer since the temp worker hired to do it is probably not a logician, mathematician, or scientist. I suspect most people’s reasoning is closer to an LLM’s than we would be comfortable admitting.

By @airstrike - 4 months
> OpenAI's ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic.

In other words, ChatGPT continues to dominate. A 0.3% drop might as well be noise.

Also the original, allegedly more expensive GPT-4 (can we call it ChatGPT-4og??) is conspicuously missing from the report...

By @osigurdson - 4 months
Companies have been salivating at the possibility of firing everybody and paying ChatGPT $20 per month instead to run the entire business. I don't have any moral objections to it but find it incredibly naive. ChatGPT / LLMs help a bit - that's it.
By @randcraw - 4 months
One think I like about this effort is their attempt to factor out the cacheing of prior answers due to having asked a similar question before. Due to the nearly eidetic memoization ability of LLMs, no cognitive benchmark can be meaningful unless the LLM's question history can somehow be voided after each query. I think this is especially true when measuring reasoning which will surely benefit greatly from the cacheing of answers from earlier questions into a working set that will enhance its associations on future similar questions -- which only looks like reasoning.
By @cyanydeez - 4 months
Best thing LLMs do is add to the theory of p-zombies among the population.

Instead of the dead Internet theory, we should start finding what percent of the population is no better than a LLM.

By @jokoon - 4 months
Finally, some people are using basic cognition science to evaluate AI

Also they mapped an insect brain

Seems like my several comments suggesting AI scientists should peek other fields, did get some attention.

That probably makes me the most talented and insightful AI scientist on the planet.

By @bubble12345 - 4 months
I mean so far LLMs can't even do addition and multiplication of integers accurately. So we can't really expect too much in terms of logical reasoning.
By @krick - 4 months
First off, I want to say this is kinda baffling to me, that this is some kind of novel "research", and it's published by Apple of all companies in the field. I could be more forgiving that some journalists try to sell it as "look, LLMs are incapable of logical reasoning!", because journalists always shout loud stupid stuff, otherwise they don't get paid, apparently. But still, it's kind of hard to justify the nature of this "advancement".

I mean, what is being described seems like super basic debug step for any real world system. This is kind of stuff not very advanced QA teams in boring banks do to test your super-boring not very advanced back-office bookkeeping systems. After this kind of testing reveals a number of bugs, you don't erase this bookkeeping system and conclude banking should be done manually on paper only, since computers are obviously incapable of making correct decisions, you fix these problems one by one, which sometimes means not just fixing a software bug, but revisioning the whole business-logic of the process. But this is, you know, routine.

So, not being aware of what are these benchmarks everyone uses to test LLM-products (please note, they are not testing LLMs as some kind of concept here, they are testing products), I would assume that OpenAI in particular, and any major company that released their own LLM product in the last couple of years in general, already does this super-obvious thing. But why this huge discovery happens now, then?

Well, obviously, there are 2 possibilities. Either none of them really do this, which sounds unbelievable - what all these high-paid genius researchers even do then? Or, more plausibly, they do, but not publish that. This one sounds reasonable, given there's no OpenAI, but AltmanAI, and all that stuff. Like, they compete to make a better general reasoning system, of course they don't want to reveal all their research.

But this doesn't really look reasonable to me (at least, at this very moment) given how basic the problem being discussed is. I mean, every school kid knows you shouldn't test on data you use for learning, so to be "peeking into answers when writing a test" only to make your product to perform slightly better on popular benchmarks seems super-cheap. I can understand when Qualcomm tweaks processors specifically to beat AnTuTu, but trying to beat problem-solving by improving your crawler to grab all tests on the internet is pointless. It seems, they should actively try to not contaminate their learning step by training on popular benchmarks. So what's going on? Are people working on these systems really that uncreative?

This said, all of it only applies to general approach, which is to say it's about what article claims, not what it shows. I personally am not convinced.

Let's take kiwi example. The whole argument is framed as if it's obvious that the model shouldn't have substracted these 5 kiwies. I don't know about that. Let's imagine, this is a real test, done by real kids. I guarantee you, the most (all?) of them would be rather confused by the wording. Like, what should we do with this information? Why was it included? Then, they will decide if they should or shouldn't substract the 5. I won't try to guess how many of them will, but the important thing is, they'll have to make this decision, and (hopefully) nobody will suddenly multiply the answer by 5 or do some meaningless shit like that.

And neither did LLMs in question, apparently.

In the end, these students will get the wrong answer, sure. But who decides if it's wrong? Well, of course, the teacher does. Why it's wrong? Well, "because it wasn't said you should discard small kiwies!" Great, man, you also didn't tell us we shouldn't discard them. This isn't a formal algebra problem, we are trying to use some common sense here.

In the end, it doesn't really matter, what teacher thinks the correct answer is, because it was just a stupid test. You may never really agree with him on this one, and it won't affect your life. Probably, you'll end up making more than him anyway, so here's your consolation.

So framing situations like this as a proof that LLM gets things objectively wrong just isn't right. It did subjectively wrong, judged by opinion of Apple researchers in question, and some other folks. Of course, this is what LLM development essentially is: doing whatever magic you deem necessary, to get it give more subjectively correct answers. And this returns it's to my first point: what is OpenAI's (Anthropic's, Meta's, etc) subjectively correct answer here? What is the end goal anyway? Why this "research" comes from "Apple researchers", not from one of these compenies' tech blogs?

By @fungiblecog - 4 months
No shit!