Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
The paper examines reasoning abilities of Large Language Models, distinguishing inductive from deductive reasoning. It introduces SolverLearner, showing LLMs excel in inductive reasoning but struggle with deductive tasks, particularly counterfactuals.
Read original articleThe paper titled "Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs" by Kewei Cheng and colleagues explores the reasoning capabilities of Large Language Models (LLMs), specifically distinguishing between inductive and deductive reasoning. The authors argue that previous research has not adequately differentiated these two types of reasoning, leading to a conflation of their respective challenges. They introduce a new framework called SolverLearner, which allows LLMs to learn functions that map input data to output values using in-context examples, thereby isolating inductive reasoning from deductive reasoning. Their findings indicate that LLMs exhibit strong inductive reasoning capabilities, achieving near-perfect performance in many cases. However, the study also reveals that LLMs struggle with deductive reasoning, particularly in tasks that require counterfactual reasoning. This research highlights the need for a clearer understanding of the reasoning abilities of LLMs and suggests that while they excel in inductive reasoning, their deductive reasoning skills are comparatively limited.
- The paper distinguishes between inductive and deductive reasoning in LLMs.
- A new framework, SolverLearner, is proposed to enhance the study of inductive reasoning.
- LLMs demonstrate strong inductive reasoning capabilities but struggle with deductive reasoning.
- The research emphasizes the importance of understanding the reasoning abilities of LLMs.
- Findings suggest a need for further exploration of LLMs' deductive reasoning, especially in counterfactual scenarios.
- Many commenters express skepticism about the ability of LLMs to genuinely reason, arguing that they primarily rely on memorization and pattern matching rather than true logical inference.
- There is a notable absence of consideration for abductive reasoning in the analysis, with some suggesting it is a significant oversight.
- Critics highlight the limitations of the experiments conducted, questioning the validity of the results due to potential biases in the training data.
- Some commenters propose that LLMs may exhibit a hybrid form of reasoning that combines statistical calculations with rudimentary reasoning processes.
- Overall, there is a consensus that more rigorous definitions and methodologies are needed to assess LLM reasoning capabilities accurately.
You cannot test reasoning when you don't know what's in the training set. You have to be able to differentiate reasoning from memorization, and that's not trivial.
Moreso, the results look to confirm that at least some memorization is going on. Do we really not think GPT has extensively been trained on arithmetic in base 10, 8, and 16? This seems like a terrible prior. Even if not explicitly, how much code has it read that performs these tasks. How many web pages, tutorials, Reddit posts cover oct and hex? They also haven't defined zero shot correctly. Arithmetic in these bases aren't 0-shot. They're explicitly in distribution...
I'm unsure about base 9 and 11. It's pretty interesting to see that GPT 4 is much better at these. Anyone know why? Did they train on these? More bases? Doesn't seem unreasonable but I don't know.
The experimentation is also extremely lacking. The arithmetic questions only have 1000 tests where they add two digits. This is certainly in the training data. I'm also unconvinced by the syntax reasoning tasks since the transformer (attention) architecture seems to be designed for this. I'm also unconvinced these tasks aren't in training. Caesar ciphers are also certainly in the training data.
The prompts are also odd and I guess that's why they're in the appendix. For example, getting GPT to be better at math or many tasks by having it write python code is not novel.
There's some stuff here but this really doesn't seem like a lot of work for 12 people from a top university and a trillion dollar company. It's odd to see that many people when the experiments can be run in a fairly short time.
Abductive reasoning is common in day-to-day life. It seeks the best explanation for some (often incomplete) observations, and reaches conclusions without certainty. I would have thought it would be important to assess for LLMs.
They are statistical text generators, whose results are defined by their training data set. This is why the paper cited reads thusly:
Despite extensive research into the reasoning capabilities
of Large Language Models (LLMs), most studies have failed
to rigorously differentiate between inductive and deductive
reasoning ...
There is no differentiation because what was sought is the existence of what does not.The authors then postulate:
This raises an essential question: In LLM reasoning, which
poses a greater challenge - deductive or inductive reasoning?
There is no such thing as "LLM reasoning." Therefore, the greatest challenge is accepting this fact and/or that anthropomorphism is a real thing.But any mention of LLM reasoning ability ought to address the obvious confound: the LLM is trained on examples of deductive reasoning, inductive reasoning, abductive reasoning, SAT-solver reasoning, geniuses' musings, etc. If they replicate one of those examples, then should that be called "reasoning" of any sort or not? Regurgitating those examples may even involve some generalization, if the original topics of an example are swapped out (perhaps by a nearby topic in latent space).
Given that it appears they're training and testing on synthetic problems, this objection probably does not apply to their actual results. But given the fuzziness it creates for the definition of "reasoning" of any sort, I would have expected some working definition of reasoning in the paper's abstract.
Training on Moby Dick and thus being able to regurgitate text from Moby Dick does not mean the LLM is capable of writing a new Moby Dick-like book. (Thankfully; one is more than enough!)
I love seeing Victor Taelin experimenting with parallizing these programs (with HVM and other experiments with proof languages), but it's sometimes a bit sad how much time researchers take in making papers about existing things instead of trying to improve the state-of-the art in something that's most probably missing from the current models.
I don't know about "typical" but every source that classifies reasoning (or, more appropriately, logical inference) as deductive and inductive, also includes the abductive category. This categorisation scheme goes all the way back to Charles Sanders Peirce:
'[Abduction] is logical inference( ... ) having a perfectly definite logical form. ( ... ) Namely, the hypothesis cannot be admitted, even as a hypothesis, unless it be supposed that it would account for the facts or some of them. The form of inference, therefore, is this:
The surprising fact, C, is observed;
But if A were true, C would be a matter of course,
Hence, there is reason to suspect that A is true.' (Collected Papers of Charles Sanders Peirce. Peirce, 1958)
(Quote copied from Abduction and Induction, Essays in their Relation and Integration, Peter Flach and Antonis Kakas eds. 200)
Consider a logical theory, formed of rules in the form of implications like A -> B (premise A implies conclusion B). Abduction is the inference of the premises after observation of the conclusions, i.e. if A -> B AND B is observed, then A may be inferred.
That's a different inference mode than both deduction: inferring a conclusion from a premise, e.g. if A -> B AND A, then B may be inferred; and induction: inferring a rule from an observation, e.g. inferring A -> B after observing A and B. Note that this is a simplification: induction assumes a background theory of more rules A1 -> A2, .... An -> A that can be applied to the observation A and B to infer A -> B.
Anyway, abduction is generally associated with probabilistic reasoning, albeit informally so. That probably means that we should categorise LLM inference as abductive, since it guesses the next token according to a model of probabilities of token sequences. But that's just a, er, guess.
But, they clearly struggle with generalization and rule following. This failure to generalize (extrapolate, deduce, compute) is why we still can't fire all of our DBAs.
Has anyone encountered an LLM-based text-to-SQL engine that actually gets the job done? I think that's your best canary. I stopped caring somewhere around "transpose these 2 letters of the alphabet" not working consistently.