October 11th, 2024

LLMs don't do formal reasoning

A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.

Read original articleLink Icon
LLMs don't do formal reasoning

A recent study by a team of AI researchers at Apple highlights a significant limitation of large language models (LLMs): their inability to perform formal reasoning. The researchers found that LLMs operate primarily through sophisticated pattern matching rather than genuine reasoning, with their performance being sensitive to minor changes in input. This fragility was illustrated through a new task called GSM-NoOp, which demonstrated that LLMs struggle with reasoning when faced with distracting information. Previous studies have shown similar results, indicating that LLMs perform adequately on small problems but falter as complexity increases. This trend is evident in tasks like integer arithmetic and even in games like chess, where LLMs fail to adhere to established rules. The authors argue that the current neural network architectures lack the capability for reliable extrapolation and formal reasoning, suggesting that a combination of neural networks with symbolic reasoning—termed neurosymbolic AI—may be essential for future advancements. Gary Marcus, a prominent figure in AI research, emphasizes the need for alternative strategies to address these shortcomings, as the existing models have not yet demonstrated the ability to reason abstractly or manipulate symbols effectively.

- A study from Apple reveals LLMs lack formal reasoning capabilities.

- LLMs primarily rely on pattern matching, making them fragile to input changes.

- Performance declines significantly as problem complexity increases.

- Neurosymbolic AI may be necessary for improving reasoning in AI models.

- Gary Marcus advocates for alternative research strategies to overcome current limitations.

Link Icon 36 comments
By @dang - 4 months
Related ongoing thread:

Understanding the Limitations of Mathematical Reasoning in LLMs - https://news.ycombinator.com/item?id=41808683 - Oct 2024 (127 comments)

By @zero_k - 4 months
Yes, we should use LLMs to translate human requirements that are ambiguous and have a lot of hidden assumptions (e.g. that football matches should preferably be at times when people are not working & awake [3]), and use them to create formal requirements, e.g. generating SMT [1] or ASP [2] queries. Then the formal methods tool, e.g. Z3/cvc5 or clingo can solve these now formal queries. Then we can translate back the solution to human language via the LLM. This does not solve some problems, e.g. the LLM not correctly guessing the implicit requirements. But it does go around a bunch of issues.

We do need to pump up the jam when it comes to formal methods tools, though. And academia is still rife with quantum and AI buzzword generators if you wanna get funding. Formal methods doesn't get enough funding from Academia. Amazon has put a bunch of money into it (hiring all good talent :sadface:), and Microsoft is funding both Z3 and Lean4. Industry is ahead of the game, again. This is purely failure of Academic leadership, nothing else.

[1] https://en.wikipedia.org/wiki/Satisfiability_modulo_theories

[2] https://en.wikipedia.org/wiki/Answer_set_programming

[3] Anecdotal, but this was a "bug" in a solution offered by a tool that optimally schedules football matches in Spain.

By @tj-teej - 4 months
If anyone is curious, a Meta Data Scientist published a great piece about how the facts about what LLMs are actually doing (and therefore able to do) and how it's papered over by using chat bots. It's a long but very engaging read.

https://medium.com/@colin.fraser/who-are-we-talking-to-when-...

By @wkat4242 - 4 months
One of the things that kinda illustrate this for me, is that an LLM always uses the same time to process a prompt of the same length. No matter how complicated the problem is. Obviously the complexity of the problem is not actually taken into account.
By @anon291 - 4 months
Current LLMs are one-shot. They are forced to produce an output without thinking, leading to the preponderance of hallucinations and lack of formal reasoning. Human formal reasoning is not instinctual. Unlike 'aha!' moments, it requires us to think. Part of that thinking process is turning our attention inwards into our own mind and using symbolic manipulations that we do not utter in order to 'think'.

LLMs broadly are capable of this, but we force them to not do it by forcing the next token to be the final output.

The human equivalent would be to solve a problem and show all your steps including steps that are wrong but that you undertook anyway. Hence why chain of reasoning works.

The 'fix' is to allow LLMS to pause, generate tokens that are not transliterated into text, and then signal when they want to unpause. Training such a system is left as an exercise to the reader, although there have been attempts

By @rahimnathwani - 4 months
The paper (published 4 days ago) has this on page 10, and says that o1-mini failed to solve it correctly:

   Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
I pasted it into ChatGPT and Claude, and all four models I tried gave the correct answer:

4o mini: https://chatgpt.com/share/6709814f-9ff8-800e-8aab-127b6f952d...

4o: https://chatgpt.com/share/6709816c-3768-800e-9eb1-173dfbb5d8...

o1-mini: https://chatgpt.com/share/67098178-4088-800e-ba95-9731a75055...

3.5 sonnet: https://gist.github.com/rahimnathwani/34f93de07eb7510d57ec1e...

By @whiplash451 - 4 months
I am not sure who the target audience of Gary Marcus is.

Those who know about LLMs are aware that they do not reason, but also know it not very useful to repeat it over and over again and focus on other aspects of research.

Those who don't know about LLMs simply learn to use them in a way that's useful in their life.

By @procgen - 4 months
ChatGPT o1-preview was not flummoxed by the small kiwis, and even called out the extraneous detail:

```

To determine the total number of kiwis Oliver has, we’ll sum up the kiwis he picked on each day:

1. Friday: Oliver picks 44 kiwis.

2. Saturday: He picks 58 kiwis.

3. Sunday: He picks double the number he did on Friday, so 2 × 44 = 88 kiwis.

Adding them up:

44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190 kiwis

The mention of five smaller-than-average kiwis on Sunday doesn’t affect the total count unless specified otherwise.

Answer: 190

```

By @raytopia - 4 months
Tangential but I do wonder how much it's actually going to cost to use these systems once the investor money gets turned off and they want a return on investment. Given that the systems are only getting bigger it can't be cheap.
By @Syzygies - 4 months
ChatGPT-4o:

44+58+88=190

So, Oliver has a total of 190 kiwis. The five smaller kiwis on Sunday are still included in the total count, so they don't change the final sum.

By @aiono - 4 months
So it means that throwing data and computing into LLMs won't make them intelligent as opposed many people claiming that will happen. Also they don't play well with the situations that are not in the dataset since they are just extrapolating from what they are trained without any real understanding.
By @kayvr - 4 months
Integer multiplication was used to test LLMs reasoning capabilities, and I think Karpathy mentioned that tokenization might play a role in basic math. MathGLM was compared against GPT-4 in the article, but I couldn't figure out if MathGLM was trained with character-level tokenization or not.
By @jumploops - 4 months
> There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bits of irrelevant info can give you a different answer.

LLMs are not magic bullets for every problem, but that doesn't preclude them from being used to build reliable systems or "agents."

It's clear that we don't yet have the all-encompassing AGI architecture, especially with the transformer model alone, but adding steps beyond the transformer leads to interesting results, as we've seen with current coding tools and the new o1-series models by OpenAI.

For example, the featured article calls out `o1-mini` as failing a kiwi-counting test prompt, however the `o1-preview` model gets the right answer[0].

I also built a simple test using gpt-4o, that prompts it to solve the problem in parts, and it reliably returns the correct answer using only gpt-4o and code generated by gpt-4o[1].

Furthermore, there's still a ton of research being done on models that are specific to formal theorem proving that show promise[2] (even if `o1-preview` already beats them for e.g. IMO problems[3]).

I'm of the opinion that we still have a ways to go until AGI, but that doesn't mean LLMs can't be used in reliable ways.

[0]https://chatgpt.com/share/e/67098356-ce88-8001-a2e1-9857064a...

[1]https://magicloops.dev/loop/30fb3c1a-8e40-47ae-8611-91554faf...

[2]https://arxiv.org/pdf/2408.08152

[3]https://openai.com/index/introducing-openai-o1-preview/

By @Max-q - 4 months
The mistakes they make are very similar to mistakes we humans do. Just like you can confuse a human with irrelevant information, you can distract the LLM. We are not good at big tables of integers, just as them.

An LLM isn't a calculator. But we probably can teach it how to use one.

By @diablozzq - 4 months
At one point the goal posts was the Turing test. That’s long since been passed, and we aren’t satisfied.

Then goal posts were moved to logical reasoning such as the Winograd Schemas. Then that wasn’t enough.

In fact, it’s abundantly clear we won’t be satisfied until we’ve completely destroyed human intelligence as superior.

The current goal post is LLMs must do everything better than humans or it’s not AGI. If there is one thing it does worse, people will cite it as just a stochastic parrot. That’s a complete fallacy.

Of course we dare not compare LLMs to the worse case human - because LLMs would be AGI compared to that.

We compare LLMs to the best human in every category - unfairly.

With LLMs it’s been abundantly clear - there is not a line where something is intelligent or not. There’s only shades of gray and eventually we call it black.

There will always be differences between LLM capabilities and humans - different architectures and different training. However it’s very clear that a process that takes huge amounts of data and processes it whether a brain or LLM come up with similar results.

Someone should up with a definition of intelligence that excludes all LLMs and includes all humans.

Also while you are at it, disprove humans do more than what ChatGPT does - aka probabilistic word generation.

I’ll wait.

Until then, as ChatGPT blows past what was science fiction 5 years ago, maybe these arguments aren’t great?

Also - name one thing we have the data for that we haven’t been able to produce a neural network capable of performing that task?

Human bodies have so many sensors it’s mind blowing. The data any human processes in one days simply blows LLMs out of the water.

Touch, taste, smell, hearing, etc…

That’s not to say if you could hook up a hypothetical neural network to a human body, that we couldn’t do the same.

By @samweb3 - 4 months
But...when I test the example, I get the right answer from 4o. Seems like they can just extend the model to identify irrelevant information over time and get the correct results more generally for similar models. -------------------------------- Let's break it down:

Friday: Oliver picks 44 kiwis. Saturday: Oliver picks 58 kiwis. Sunday: He picks double the amount he picked on Friday: 44x2=88 44×2=88 kiwis. Now, we sum all the kiwis: 44+58+88=190 Since the size of five kiwis on Sunday doesn’t affect the total count, Oliver still has: 190 kiwis.

By @andrewla - 4 months
Like many other commenters, I was unable to reproduce the behavior cited in the link. I do like that this is attempting to make explicit the specific form of "formal reasoning" that is being used here, even if I do not necessarily agree that we have a clean separation between the ideas of "pattern matching" and "formal reasoning", or even any real evidence that humans are capable of one and not the other.

The idea that "LLMs have difficulty ignoring extraneous and irrelevant information" is not really dispositive to their effectiveness, since this statement obviously applies to humans as well.

By @YuukiRey - 4 months
I wanted to try this with Chat GPT

I buy 102 mandarins on Monday and then on Tuesday I buy another 48. On Wednesday I buy 98 apples. I didn't like the last 3 mandarins I bought. How many mandarins do I have?

You bought 102 mandarins on Monday and 48 more on Tuesday, which gives a total of:

102 + 48 = 150 mandarins.

Since you didn't like the last 3 mandarins, you subtract them:

150 - 3 = 147 mandarins.

So, you have 147 mandarins.

By @lostmsu - 4 months
Neither does Gary Marcus. I'd watch him try to determine truthfulness of the following expression: !!!!...!!true where the number of exclamation marks is chosen at random between 100500 and 500100 without using any external tools.

This would have been an argument against LLMs reasoning if you concede from the above that humans also don't do formal reasoning.

By @renewiltord - 4 months
Tool use is the measure of intelligence. Terence Tao can use this tool for mathematics.

When Google came out, search engines were suddenly more useful. But there were a bunch of people talking about how “Not everything they find is right” and how “that is a huge problem”.

Then for two decades, people used search highly successfully. Fascinating thing. Tool use.

By @siscia - 4 months
In general I agree, but it also true that we are not using the LLM as well as we should.

The example in the article: https://chatgpt.com/share/6709a02d-b7cc-800c-882b-430bf019a0...

By @stuaxo - 4 months
Yep, anyone that uses these a bunch should concur.
By @resters - 4 months
I'm working on this: Abstract:

This paper presents a novel framework for multi-stream tokenization, which extends traditional NLP tokenization by generating simultaneous, multi-layered token representations that integrate subword embeddings, logical forms, referent tracking, scope management, and world distinctions. Unlike conventional language models that tokenize based solely on surface linguistic features (e.g., subword units) and infer relationships through deep contextual embeddings, our system outputs a rich, structured token stream. These streams include logical expressions (e.g., `∃x (John(x) ∧ Loves(x, Mary))`), referent identifiers (`ref_1`, `ref_2`), and world scopes (`world_1`, `world_2`) in parallel, enabling precise handling of referential continuity, modal logic, temporal reasoning, and ambiguity resolution across multiple passages and genres, including mathematical texts, legal documents, and natural language narratives.

This approach leverages symbolic logic and neural embeddings in a hybrid architecture, enhancing the model’s capacity for reasoning and referential disambiguation in contexts where linguistic and logical complexity intertwine. For instance, tokens for modal logic are generated concurrently with referential tokens, allowing expressions such as "If John had gone to the store, Mary would have stayed home" to be dynamically represented across possible worlds (`world_1`, `world_2`) with embedded logical dependencies (`If(Go(John, Store), Stay(Mary, Home))`).

We explore how each token stream (e.g., subword, referent, logical, scope, world) interacts in real time within a transformer-based architecture, employing distinct embedding spaces for each type. The referent space (`ref_n`) facilitates consistent entity tracking, even across ambiguous or coreferential contexts, while scope spaces (`scope_n`) manage logical boundaries such as conditional or nested clauses. Additionally, ambiguity tokens (`AMBIGUOUS(A,B)`) are introduced to capture multiple possible meanings, ensuring that referents like "bank" (financial institution or riverbank) can be resolved as more context is processed.

By extending the capabilities of existing neuro-symbolic models (e.g., Neural Theorem Provers and Hybrid NLP Systems) and integrating them with modern transformer architectures (Vaswani et al., 2017), this system addresses key limitations in current models, particularly in their handling of complex logical structures and referent disambiguation. This work sets the foundation for a new class of multi-dimensional language models that are capable of performing logical reasoning and context-sensitive disambiguation across diverse textual domains, opening new avenues for NLP applications in fields like law, mathematics, and advanced AI reasoning systems.

By @major4x - 4 months
What an obvious article. But it, because it comes from Apple, everybody pays attention. Proof by pedigree. OK, here is my two cents. Firstly, I did my Ph.D. in AI (algorithm design with application to AI) and I also spent seven years applying some of the ideas at Xerox PARC (yes, the same (in)famous research lab). So, I went to and published at many AI conferences (AAAI, ECAI, etc.). Of course, when I was younger and less cynical, I would enter into lengthy philosophical discussions with dignitaries of AI on what does AI mean and it would be long dinners and drinks, and wheelbarrows of ego. Long story, short, there is no such thing as AI. It is a collection of disciplines: the recently famous Machine Learning (transformers trained on large corpora of text), constraint-based reasoning, Boolean satisfiability, theorem proving, probabilistic reasoning, etc., etc. Of course, LLMs are a great achievement and they have good application to Natural Language Processing (also intermingled discipline and considered constituent of AI).

Look at the algorithmic tools used in ML and automated theorem proving for example: ML uses gradient descent (and related numerical methods) for local optimization, while constraint satisfaction/optimization/Boolean satisfiability, SAT modulo-theories, Quantified Boolean Optimization, etc., rely on combinatorial optimization. Mathematically, combinatorial optimization is far more problematic compared to numerical methods and much more difficult, largely because modern computers and NVidia gaming cards are really fast in crunching floating point numbers and also largely that most problems in combinatorial optimization NP-hard or harder.

Now thing of what LLM and local optimization is doing: it is essentially searching/combining sequences of words from Wikipedia and books. But search is not necessarily a difficult problem, it is actually an O(1) problem. While multiplying numbers is an O(n^2.8 (or whatever constant they came up with)) problem while factorization is (God knows what class of complexity) when you take quantum computing into the game).

Great, these are my 2 cents for the day, good luck to the OpenAI investors (I am also investing there a bit as a Bay Area citizen). You guys will certainly make help desk support cheaper...

By @Dig1t - 4 months
You could substitute "LLMs" -> "Humans" and the statement would also be true.
By @bartread - 4 months
This trope of proclaiming some critical flaw in the functioning of LLMs with the implication that they therefore should not be used is getting boring.

LLMs are far from perfect but they can be a very useful tool that, used well, can add significant value in spite of their flaws. Large numbers of people and businesses are extracting huge value from the use of LLMs every single day. Some people are building what will become wildly successful businesses around LLM technology.

Yet in the face of this we still see a population of naysayers who appear intent on rubbishing LLMs at any cost. To me that seems like a pretty bad faith dialogue.

I’m aware that a lot of the positive rhetoric, particularly early on after the first public release of ChatGPT was overstated - sometimes heavily so - but taking one set of shitty arguments and rhetoric and responding to it with polar opposite, but equally shitty, arguments and rhetoric for the most part only serves to double the quantity of shitty arguments and rhetoric (and, adding insult to injury, often does so in the name of “balance”).

By @nisten - 4 months
Is hackernews..hacked?
By @sockaddr - 4 months
No shit, Gary.

Every time I see this guy pop up is some bad take or argument with someone. What’s the deal with him?

By @hackinthebochs - 4 months
Getting tired of seeing this guy's bad arguments get signal boosted. I posted this comment on another LLM thread on the front page today, and I'll just repost it here:

LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.

By @levocardia - 4 months
Well, it's about time we moved the goalposts again. All this business about trophies fitting in suitcases being the gold standard was getting pretty embarrassing.
By @ajross - 4 months
It seems like the needle is now swinging too far back, pointing to "LLMs will NEVER work". And I don't think that's very grounded either.

All these criticisms are valid for human beings too. That kind of question trickery trips up school kids all the time. It's hard to use our brains to reason. It takes practice, and the respresentation of the "reasoning" always ends up being alien to our actual cognitive experience. We literally have invented whole paradigms of how to write this stuff down such that it can be communicated to our peers.

So yeah, LLMs aren't ever going to be "better" at humans at reasoning, necessarily, simply because we both suck at it. But they'll improve, likely via a bunch of analogs to human education. "Here's how to teach a LLM about writing a formal proof" just hasn't been figured out yet.

By @serjester - 4 months
They’re arguing since it’s not close to perfect, it’s not useful? Seems like a straw man.
By @nuancebydefault - 4 months
The thing is, from a written human readable text, there is no single formal reasoning. The text itself is not formal. The facts that kiwis are bigger or smaller might seem irrelevant for counting the amount of kiwis, but there is no formal proof of that possible. I might argue that counting might include volume or weight, you might argue that one kiwi is one kiwi. So saying that llm's don't do formal reasoning is not saying anything, as it doesn't mean anything when you start from written sentences.

My point being, LLMs are capable of reasoning and formal reasoning is meaningless in the context.