LLMs don't do formal reasoning
A study by Apple researchers reveals that large language models struggle with formal reasoning, relying on pattern matching. They suggest neurosymbolic AI may enhance reasoning capabilities, as current models are limited.
Read original articleA recent study by a team of AI researchers at Apple highlights a significant limitation of large language models (LLMs): their inability to perform formal reasoning. The researchers found that LLMs operate primarily through sophisticated pattern matching rather than genuine reasoning, with their performance being sensitive to minor changes in input. This fragility was illustrated through a new task called GSM-NoOp, which demonstrated that LLMs struggle with reasoning when faced with distracting information. Previous studies have shown similar results, indicating that LLMs perform adequately on small problems but falter as complexity increases. This trend is evident in tasks like integer arithmetic and even in games like chess, where LLMs fail to adhere to established rules. The authors argue that the current neural network architectures lack the capability for reliable extrapolation and formal reasoning, suggesting that a combination of neural networks with symbolic reasoning—termed neurosymbolic AI—may be essential for future advancements. Gary Marcus, a prominent figure in AI research, emphasizes the need for alternative strategies to address these shortcomings, as the existing models have not yet demonstrated the ability to reason abstractly or manipulate symbols effectively.
- A study from Apple reveals LLMs lack formal reasoning capabilities.
- LLMs primarily rely on pattern matching, making them fragile to input changes.
- Performance declines significantly as problem complexity increases.
- Neurosymbolic AI may be necessary for improving reasoning in AI models.
- Gary Marcus advocates for alternative research strategies to overcome current limitations.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
Transcript for Yann LeCun: AGI and the Future of AI – Lex Fridman Podcast
Yann LeCun discusses the limitations of large language models, emphasizing their lack of real-world understanding and sensory data processing, while advocating for open-source AI development and expressing optimism about beneficial AGI.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.
Understanding the Limitations of Mathematical Reasoning in LLMs - https://news.ycombinator.com/item?id=41808683 - Oct 2024 (127 comments)
We do need to pump up the jam when it comes to formal methods tools, though. And academia is still rife with quantum and AI buzzword generators if you wanna get funding. Formal methods doesn't get enough funding from Academia. Amazon has put a bunch of money into it (hiring all good talent :sadface:), and Microsoft is funding both Z3 and Lean4. Industry is ahead of the game, again. This is purely failure of Academic leadership, nothing else.
[1] https://en.wikipedia.org/wiki/Satisfiability_modulo_theories
[2] https://en.wikipedia.org/wiki/Answer_set_programming
[3] Anecdotal, but this was a "bug" in a solution offered by a tool that optimally schedules football matches in Spain.
https://medium.com/@colin.fraser/who-are-we-talking-to-when-...
LLMs broadly are capable of this, but we force them to not do it by forcing the next token to be the final output.
The human equivalent would be to solve a problem and show all your steps including steps that are wrong but that you undertook anyway. Hence why chain of reasoning works.
The 'fix' is to allow LLMS to pause, generate tokens that are not transliterated into text, and then signal when they want to unpause. Training such a system is left as an exercise to the reader, although there have been attempts
Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?
I pasted it into ChatGPT and Claude, and all four models I tried gave the correct answer:4o mini: https://chatgpt.com/share/6709814f-9ff8-800e-8aab-127b6f952d...
4o: https://chatgpt.com/share/6709816c-3768-800e-9eb1-173dfbb5d8...
o1-mini: https://chatgpt.com/share/67098178-4088-800e-ba95-9731a75055...
3.5 sonnet: https://gist.github.com/rahimnathwani/34f93de07eb7510d57ec1e...
Those who know about LLMs are aware that they do not reason, but also know it not very useful to repeat it over and over again and focus on other aspects of research.
Those who don't know about LLMs simply learn to use them in a way that's useful in their life.
```
To determine the total number of kiwis Oliver has, we’ll sum up the kiwis he picked on each day:
1. Friday: Oliver picks 44 kiwis.
2. Saturday: He picks 58 kiwis.
3. Sunday: He picks double the number he did on Friday, so 2 × 44 = 88 kiwis.
Adding them up:
44 (Friday) + 58 (Saturday) + 88 (Sunday) = 190 kiwis
The mention of five smaller-than-average kiwis on Sunday doesn’t affect the total count unless specified otherwise.
Answer: 190
```
44+58+88=190
So, Oliver has a total of 190 kiwis. The five smaller kiwis on Sunday are still included in the total count, so they don't change the final sum.
LLMs are not magic bullets for every problem, but that doesn't preclude them from being used to build reliable systems or "agents."
It's clear that we don't yet have the all-encompassing AGI architecture, especially with the transformer model alone, but adding steps beyond the transformer leads to interesting results, as we've seen with current coding tools and the new o1-series models by OpenAI.
For example, the featured article calls out `o1-mini` as failing a kiwi-counting test prompt, however the `o1-preview` model gets the right answer[0].
I also built a simple test using gpt-4o, that prompts it to solve the problem in parts, and it reliably returns the correct answer using only gpt-4o and code generated by gpt-4o[1].
Furthermore, there's still a ton of research being done on models that are specific to formal theorem proving that show promise[2] (even if `o1-preview` already beats them for e.g. IMO problems[3]).
I'm of the opinion that we still have a ways to go until AGI, but that doesn't mean LLMs can't be used in reliable ways.
[0]https://chatgpt.com/share/e/67098356-ce88-8001-a2e1-9857064a...
[1]https://magicloops.dev/loop/30fb3c1a-8e40-47ae-8611-91554faf...
An LLM isn't a calculator. But we probably can teach it how to use one.
Then goal posts were moved to logical reasoning such as the Winograd Schemas. Then that wasn’t enough.
In fact, it’s abundantly clear we won’t be satisfied until we’ve completely destroyed human intelligence as superior.
The current goal post is LLMs must do everything better than humans or it’s not AGI. If there is one thing it does worse, people will cite it as just a stochastic parrot. That’s a complete fallacy.
Of course we dare not compare LLMs to the worse case human - because LLMs would be AGI compared to that.
We compare LLMs to the best human in every category - unfairly.
With LLMs it’s been abundantly clear - there is not a line where something is intelligent or not. There’s only shades of gray and eventually we call it black.
There will always be differences between LLM capabilities and humans - different architectures and different training. However it’s very clear that a process that takes huge amounts of data and processes it whether a brain or LLM come up with similar results.
Someone should up with a definition of intelligence that excludes all LLMs and includes all humans.
Also while you are at it, disprove humans do more than what ChatGPT does - aka probabilistic word generation.
I’ll wait.
Until then, as ChatGPT blows past what was science fiction 5 years ago, maybe these arguments aren’t great?
Also - name one thing we have the data for that we haven’t been able to produce a neural network capable of performing that task?
Human bodies have so many sensors it’s mind blowing. The data any human processes in one days simply blows LLMs out of the water.
Touch, taste, smell, hearing, etc…
That’s not to say if you could hook up a hypothetical neural network to a human body, that we couldn’t do the same.
Friday: Oliver picks 44 kiwis. Saturday: Oliver picks 58 kiwis. Sunday: He picks double the amount he picked on Friday: 44x2=88 44×2=88 kiwis. Now, we sum all the kiwis: 44+58+88=190 Since the size of five kiwis on Sunday doesn’t affect the total count, Oliver still has: 190 kiwis.
The idea that "LLMs have difficulty ignoring extraneous and irrelevant information" is not really dispositive to their effectiveness, since this statement obviously applies to humans as well.
I buy 102 mandarins on Monday and then on Tuesday I buy another 48. On Wednesday I buy 98 apples. I didn't like the last 3 mandarins I bought. How many mandarins do I have?
You bought 102 mandarins on Monday and 48 more on Tuesday, which gives a total of:
102 + 48 = 150 mandarins.
Since you didn't like the last 3 mandarins, you subtract them:
150 - 3 = 147 mandarins.
So, you have 147 mandarins.
This would have been an argument against LLMs reasoning if you concede from the above that humans also don't do formal reasoning.
When Google came out, search engines were suddenly more useful. But there were a bunch of people talking about how “Not everything they find is right” and how “that is a huge problem”.
Then for two decades, people used search highly successfully. Fascinating thing. Tool use.
The example in the article: https://chatgpt.com/share/6709a02d-b7cc-800c-882b-430bf019a0...
This paper presents a novel framework for multi-stream tokenization, which extends traditional NLP tokenization by generating simultaneous, multi-layered token representations that integrate subword embeddings, logical forms, referent tracking, scope management, and world distinctions. Unlike conventional language models that tokenize based solely on surface linguistic features (e.g., subword units) and infer relationships through deep contextual embeddings, our system outputs a rich, structured token stream. These streams include logical expressions (e.g., `∃x (John(x) ∧ Loves(x, Mary))`), referent identifiers (`ref_1`, `ref_2`), and world scopes (`world_1`, `world_2`) in parallel, enabling precise handling of referential continuity, modal logic, temporal reasoning, and ambiguity resolution across multiple passages and genres, including mathematical texts, legal documents, and natural language narratives.
This approach leverages symbolic logic and neural embeddings in a hybrid architecture, enhancing the model’s capacity for reasoning and referential disambiguation in contexts where linguistic and logical complexity intertwine. For instance, tokens for modal logic are generated concurrently with referential tokens, allowing expressions such as "If John had gone to the store, Mary would have stayed home" to be dynamically represented across possible worlds (`world_1`, `world_2`) with embedded logical dependencies (`If(Go(John, Store), Stay(Mary, Home))`).
We explore how each token stream (e.g., subword, referent, logical, scope, world) interacts in real time within a transformer-based architecture, employing distinct embedding spaces for each type. The referent space (`ref_n`) facilitates consistent entity tracking, even across ambiguous or coreferential contexts, while scope spaces (`scope_n`) manage logical boundaries such as conditional or nested clauses. Additionally, ambiguity tokens (`AMBIGUOUS(A,B)`) are introduced to capture multiple possible meanings, ensuring that referents like "bank" (financial institution or riverbank) can be resolved as more context is processed.
By extending the capabilities of existing neuro-symbolic models (e.g., Neural Theorem Provers and Hybrid NLP Systems) and integrating them with modern transformer architectures (Vaswani et al., 2017), this system addresses key limitations in current models, particularly in their handling of complex logical structures and referent disambiguation. This work sets the foundation for a new class of multi-dimensional language models that are capable of performing logical reasoning and context-sensitive disambiguation across diverse textual domains, opening new avenues for NLP applications in fields like law, mathematics, and advanced AI reasoning systems.
Look at the algorithmic tools used in ML and automated theorem proving for example: ML uses gradient descent (and related numerical methods) for local optimization, while constraint satisfaction/optimization/Boolean satisfiability, SAT modulo-theories, Quantified Boolean Optimization, etc., rely on combinatorial optimization. Mathematically, combinatorial optimization is far more problematic compared to numerical methods and much more difficult, largely because modern computers and NVidia gaming cards are really fast in crunching floating point numbers and also largely that most problems in combinatorial optimization NP-hard or harder.
Now thing of what LLM and local optimization is doing: it is essentially searching/combining sequences of words from Wikipedia and books. But search is not necessarily a difficult problem, it is actually an O(1) problem. While multiplying numbers is an O(n^2.8 (or whatever constant they came up with)) problem while factorization is (God knows what class of complexity) when you take quantum computing into the game).
Great, these are my 2 cents for the day, good luck to the OpenAI investors (I am also investing there a bit as a Bay Area citizen). You guys will certainly make help desk support cheaper...
LLMs are far from perfect but they can be a very useful tool that, used well, can add significant value in spite of their flaws. Large numbers of people and businesses are extracting huge value from the use of LLMs every single day. Some people are building what will become wildly successful businesses around LLM technology.
Yet in the face of this we still see a population of naysayers who appear intent on rubbishing LLMs at any cost. To me that seems like a pretty bad faith dialogue.
I’m aware that a lot of the positive rhetoric, particularly early on after the first public release of ChatGPT was overstated - sometimes heavily so - but taking one set of shitty arguments and rhetoric and responding to it with polar opposite, but equally shitty, arguments and rhetoric for the most part only serves to double the quantity of shitty arguments and rhetoric (and, adding insult to injury, often does so in the name of “balance”).
Every time I see this guy pop up is some bad take or argument with someone. What’s the deal with him?
LLMs aren't totally out of scope of mathematical reasoning. LLMs roughly do two things, move data around, and recognize patterns. Reasoning leans heavily on moving data around according to context-sensitive rules. This is well within the scope of LLMs. The problem is that general problem solving requires potentially arbitrary amounts of moving data, but current LLM architectures have a fixed amount of translation/rewrite steps they can perform before they must produce output. This means most complex reasoning problems are out of bounds for LLMs so they learn to lean heavily on pattern matching. But this isn't an intrinsic limitation to LLMs as a class of computing device, just the limits of current architectures.
All these criticisms are valid for human beings too. That kind of question trickery trips up school kids all the time. It's hard to use our brains to reason. It takes practice, and the respresentation of the "reasoning" always ends up being alien to our actual cognitive experience. We literally have invented whole paradigms of how to write this stuff down such that it can be communicated to our peers.
So yeah, LLMs aren't ever going to be "better" at humans at reasoning, necessarily, simply because we both suck at it. But they'll improve, likely via a bunch of analogs to human education. "Here's how to teach a LLM about writing a formal proof" just hasn't been figured out yet.
My point being, LLMs are capable of reasoning and formal reasoning is meaningless in the context.
Related
Reasoning skills of large language models are often overestimated
Large language models like GPT-4 rely heavily on memorization over reasoning, excelling in common tasks but struggling in novel scenarios. MIT CSAIL research emphasizes the need to enhance adaptability and decision-making processes.
Have we stopped to think about what LLMs model?
Recent discussions critique claims that large language models understand language, emphasizing their limitations in capturing human linguistic complexities. The authors warn against deploying LLMs in critical sectors without proper regulation.
Transcript for Yann LeCun: AGI and the Future of AI – Lex Fridman Podcast
Yann LeCun discusses the limitations of large language models, emphasizing their lack of real-world understanding and sensory data processing, while advocating for open-source AI development and expressing optimism about beneficial AGI.
LLMs still can't reason like humans
Recent discussions reveal that large language models (LLMs) struggle with basic reasoning tasks, scoring significantly lower than humans. A project called "Simple Bench" aims to quantify these shortcomings in LLM performance.