Does Reasoning Emerge? Probabilities of Causation in Large Language Models
The paper investigates the reasoning capabilities of large language models, focusing on probability of necessity and sufficiency, and proposes a framework to evaluate their reasoning through mathematical examples.
Read original articleThe paper titled "Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models" by Javier González and Aditya V. Nori explores the reasoning capabilities of large language models (LLMs) in relation to human-like thinking. It addresses the ongoing debate regarding the extent to which LLMs can perform actual reasoning, focusing on two probabilistic concepts: the probability of necessity (PN) and the probability of sufficiency (PS). The authors propose a theoretical and practical framework to evaluate how effectively LLMs can mimic real-world reasoning mechanisms using these probabilistic measures. By conceptualizing LLMs as abstract machines that process information through natural language, the study investigates the conditions necessary for computing approximations of PN and PS. The research aims to enhance understanding of when LLMs exhibit reasoning capabilities, supported by a series of mathematical examples.
- The paper examines reasoning capabilities of large language models (LLMs).
- It focuses on the concepts of probability of necessity (PN) and probability of sufficiency (PS).
- A theoretical and practical framework is introduced for evaluating LLM reasoning.
- The study investigates conditions for approximating PN and PS in LLMs.
- Mathematical examples are used to illustrate LLM reasoning capabilities.
Related
Reasoning in Large Language Models: A Geometric Perspective
The paper delves into how large language models reason geometrically, linking self-attention graph density to expressive power. Higher intrinsic dimensions enhance LLMs' capacity, supported by theoretical, toy examples, and empirical evidence.
Can Large Language Models Understand Symbolic Graphics Programs?
The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.
- Many commenters express skepticism about LLMs' ability to abstract and reason like humans, suggesting they primarily pattern-match based on training data.
- There is a concern that proposed benchmarks for measuring reasoning may not be sustainable, as future models could be fine-tuned to perform well on these specific tests without demonstrating true understanding.
- Some argue that LLMs lack cognitive faculties that go beyond semantic processing, emphasizing that human cognition involves more than just language manipulation.
- Comments highlight the limitations of current LLM architectures in handling complex reasoning tasks, with some suggesting that more advanced algorithms may be necessary.
- There is a general consensus that simply increasing training data will not lead to artificial general intelligence (AGI), as deeper cognitive processes are required.
- A form of reasoning is to connect cause and effect via probability of necessity (PN) and the probability of sufficiency (PS).
- You can identify when the natural language inputs can support PN and PS inference based on LLM modeling
That would mean you can engineer in more causal reasoning based on data input and model architecture.
They define causal functions, project accuracy measures (false positives/negatives) onto factual and counter-factual assertion tests, and measure LLM performance wrt this accuracy. They establish surprisingly low tolerance for counterfactual error rate, and suggest it might indicate an upper limit for reasoning based on current LLM architectures.
Their findings are limited by how constrained their approach is (short simple boolean chains). It's hard to see how this approach could be extended to more complex reasoning. Conversely, if/since LLM's can't get this right, it's hard to see them progressing at the rates hoped, unless this approach somehow misses a dynamic of a larger model.
It seems like this would be a very useful starting point for LLM quality engineering, at least for simple inference.
The reason they sometimes appear to reason is because there's a lot of reasoning in the corpus of human text activity. But that's just a semantic artifact of a non-semantic process.
Human cognition is much more than just our ability to string sentences together.
If this gets attention, the next generation of LLMs will be trained on this paper, and then fine-tuned by using this exact form of questions to appear strong on this benchmark, and... we're back to square one.
I would expect that a higher level algorithm would be required to string together thoughts into understandings.
Then again, I wonder if what we are going to see is fundamentally different kinds of intelligences that just do not necessarily think like humans. Chimps cannot tell you about last Tuesday since their memory seems a lot more associative than recall based. But they have situational awareness that even our superheroes in our comics do not generally posses (flash some numbers in front of a chimp for one second and he will remember all their positions and order even if you distract him immediately after). Maybe LLMs cannot be human intelligent but you could argue that they are a kind of intelligence.
No.
As much as Google, Microsoft, OpenAI, and every other company that's poured billions into this technology want to think otherwise - more training data will not turn your AI model into AGI.
Any argument to the contrary is copium.
The connection might need some fleshing out, but I believe, and I might be wrong here, it was decided a few centuries ago that probabilities alone cannot explain causality. It would be a hoot, wouldn’t it?
Perhaps AI just need some a priori synthetics to spruce it up.
Related
Reasoning in Large Language Models: A Geometric Perspective
The paper delves into how large language models reason geometrically, linking self-attention graph density to expressive power. Higher intrinsic dimensions enhance LLMs' capacity, supported by theoretical, toy examples, and empirical evidence.
Can Large Language Models Understand Symbolic Graphics Programs?
The study evaluates large language models' understanding of symbolic graphics programs, introducing a benchmark and Symbolic Instruction Tuning to enhance reasoning and instruction-following capabilities in visual content comprehension.