August 16th, 2024

Does Reasoning Emerge? Probabilities of Causation in Large Language Models

The paper investigates the reasoning capabilities of large language models, focusing on probability of necessity and sufficiency, and proposes a framework to evaluate their reasoning through mathematical examples.

Read original articleLink Icon
SkepticismCuriosityFrustration
Does Reasoning Emerge? Probabilities of Causation in Large Language Models

The paper titled "Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models" by Javier González and Aditya V. Nori explores the reasoning capabilities of large language models (LLMs) in relation to human-like thinking. It addresses the ongoing debate regarding the extent to which LLMs can perform actual reasoning, focusing on two probabilistic concepts: the probability of necessity (PN) and the probability of sufficiency (PS). The authors propose a theoretical and practical framework to evaluate how effectively LLMs can mimic real-world reasoning mechanisms using these probabilistic measures. By conceptualizing LLMs as abstract machines that process information through natural language, the study investigates the conditions necessary for computing approximations of PN and PS. The research aims to enhance understanding of when LLMs exhibit reasoning capabilities, supported by a series of mathematical examples.

- The paper examines reasoning capabilities of large language models (LLMs).

- It focuses on the concepts of probability of necessity (PN) and probability of sufficiency (PS).

- A theoretical and practical framework is introduced for evaluating LLM reasoning.

- The study investigates conditions for approximating PN and PS in LLMs.

- Mathematical examples are used to illustrate LLM reasoning capabilities.

AI: What people are saying
The comments on the article about large language models (LLMs) and their reasoning capabilities reveal several key themes and points of contention.
  • Many commenters express skepticism about LLMs' ability to abstract and reason like humans, suggesting they primarily pattern-match based on training data.
  • There is a concern that proposed benchmarks for measuring reasoning may not be sustainable, as future models could be fine-tuned to perform well on these specific tests without demonstrating true understanding.
  • Some argue that LLMs lack cognitive faculties that go beyond semantic processing, emphasizing that human cognition involves more than just language manipulation.
  • Comments highlight the limitations of current LLM architectures in handling complex reasoning tasks, with some suggesting that more advanced algorithms may be necessary.
  • There is a general consensus that simply increasing training data will not lead to artificial general intelligence (AGI), as deeper cognitive processes are required.
Link Icon 12 comments
By @layer8 - 6 months
My impression is that LLMs “pattern-match” on a less abstract level than general-purpose reasoning requires. They capture a large number of typical reasoning patterns through their training, but it is not sufficiently decoupled, or generalized, from what the reasoning is about in each of the concrete instances that occur in the training data. As a result, the apparent reasoning capability that LLMs exhibit significantly depends on what they are asked to reason about, and even depends on representational aspects like the sentence patterns used in the query. LLMs seem to be largely unable to symbolically abstract (as opposed to interpolate) from what is exemplified in the training data.
By @w10-1 - 6 months
Their hypothesis is a good one:

- A form of reasoning is to connect cause and effect via probability of necessity (PN) and the probability of sufficiency (PS).

- You can identify when the natural language inputs can support PN and PS inference based on LLM modeling

That would mean you can engineer in more causal reasoning based on data input and model architecture.

They define causal functions, project accuracy measures (false positives/negatives) onto factual and counter-factual assertion tests, and measure LLM performance wrt this accuracy. They establish surprisingly low tolerance for counterfactual error rate, and suggest it might indicate an upper limit for reasoning based on current LLM architectures.

Their findings are limited by how constrained their approach is (short simple boolean chains). It's hard to see how this approach could be extended to more complex reasoning. Conversely, if/since LLM's can't get this right, it's hard to see them progressing at the rates hoped, unless this approach somehow misses a dynamic of a larger model.

It seems like this would be a very useful starting point for LLM quality engineering, at least for simple inference.

By @heyjamesknight - 6 months
LLMs have access to the space of collective semantic understanding. I don't understand why people expect cognitive faculties that are clearly extra-semantic to just fall out of them eventually.

The reason they sometimes appear to reason is because there's a lot of reasoning in the corpus of human text activity. But that's just a semantic artifact of a non-semantic process.

Human cognition is much more than just our ability to string sentences together.

By @doe_eyes - 6 months
This is proposed as a way to measure "true" reasoning by asking a certain type of trick questions, but I don't quite see how this could be a basis of a sustainable benchmark.

If this gets attention, the next generation of LLMs will be trained on this paper, and then fine-tuned by using this exact form of questions to appear strong on this benchmark, and... we're back to square one.

By @IgorPartola - 6 months
So the best way I can describe how humans abstract our thinking is that “a thought about a thought is itself a thought”. I am not an expert but I don’t believe LLMs can arbitrarily abstract their current “thought”, put it down for later contemplation, trace back to a previous thought or a random thought, and out of these individual thoughts form an understanding.

I would expect that a higher level algorithm would be required to string together thoughts into understandings.

Then again, I wonder if what we are going to see is fundamentally different kinds of intelligences that just do not necessarily think like humans. Chimps cannot tell you about last Tuesday since their memory seems a lot more associative than recall based. But they have situational awareness that even our superheroes in our comics do not generally posses (flash some numbers in front of a chimp for one second and he will remember all their positions and order even if you distract him immediately after). Maybe LLMs cannot be human intelligent but you could argue that they are a kind of intelligence.

By @layer8 - 6 months
Regarding AI reasoning and abstraction capabilities, the ARC Prize competition is an interesting project: https://arcprize.org/
By @abcde777666 - 6 months
There are many pillars of our own intelligence that we tend to gloss over. For instance - awareness and the ability to direct attention. Or something as simple as lifting your hand and moving some fingers at will. Those things impress me far more than the noises we produce with our mouths!
By @slashdave - 6 months
There seems to be the implicit (and unspoken) assumption that these probability terms (PN,PS) are all independent. However, clearly they are not.
By @rockskon - 6 months
Simple answer to the question posed by the headline:

No.

As much as Google, Microsoft, OpenAI, and every other company that's poured billions into this technology want to think otherwise - more training data will not turn your AI model into AGI.

Any argument to the contrary is copium.

By @sweeter - 6 months
probability? None imo.
By @kantapproves - 6 months
I believe Hume (and Kant) have some things to say about this.

The connection might need some fleshing out, but I believe, and I might be wrong here, it was decided a few centuries ago that probabilities alone cannot explain causality. It would be a hoot, wouldn’t it?

Perhaps AI just need some a priori synthetics to spruce it up.