November 2nd, 2024

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

Apple researchers found that many AI models struggle with basic arithmetic when irrelevant data is included, highlighting a lack of genuine logical reasoning and cautioning against overestimating AI's intelligence.

Read original article

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

Apple researchers have highlighted a significant limitation in artificial intelligence (AI) through a recent experiment. In a study, they tested various AI models on a simple arithmetic problem involving the calculation of kiwis picked over three days. While a typical schoolchild could solve the problem correctly, over 20 advanced AI models failed to do so, particularly when the problems included irrelevant data. This performance drop raises questions about the current capabilities of AI, suggesting that these models lack genuine logical reasoning. The researchers concluded that simply scaling data or models will not fundamentally resolve these issues. Their findings serve as a cautionary note against overestimating AI's intelligence, emphasizing that while AI can perform certain tasks impressively, its understanding and reasoning abilities are not as advanced as they may seem. The paper aims to temper enthusiasm for AI by underscoring the distinction between simulated intelligence and true logical reasoning.

- Apple researchers found that many AI models struggle with basic arithmetic when irrelevant data is included.

- The study indicates that current AI models lack genuine logical reasoning capabilities.

- Scaling data or models is unlikely to solve the fundamental issues identified in AI performance.

- The findings serve as a caution against overestimating AI's intelligence and capabilities.

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

Benchmarks used to assess AI models may mislead, lacking crucial insights. Google and Meta's AI boasts are criticized for outdated, unreliable tests. Experts urge more rigorous evaluation methods amid concerns about AI's implications.

The Sobering Reality of AI: A Researcher's Perspective

Terrance Craddock critiques large language models, noting a 10% success rate in accurate responses. He highlights their unreliability through a simple test, raising concerns about AI's practical applications and credibility.

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

Apple's study reveals significant flaws in large language models' logical reasoning, showing they rely on pattern matching. Minor input changes lead to inconsistent answers, suggesting a need for neurosymbolic AI integration.

LLMs can't perform "genuine logical reasoning," Apple researchers suggest

A study by Apple engineers shows large language models struggle with logical reasoning, exhibiting significant accuracy drops with minor changes to problems, indicating reliance on pattern matching over true understanding.

4 comments

By @keernan - 6 months

LLMs have been hyped as AI. The hype train transformed NVidea from a relatively minor niche tech company into the largest company in the world in the past 24 months. The US government called emergency meetings to address the urgent threat posed by unrestrained new AI releases. Sam Altman was fired - and rehired - over concerns regarding threats to humanity.

The tech world has been consumed and transformed by "AI" over the psst 24 months. Microsft and other big tech companies have committed to invest billions upon billions in the effort to win the "AI race".

And its all bogus hype.

First, LLMs are darn good as language models. They are very helpful in language based fields, such as programming, but even then they need human oversight. Because they are NOT AI. They cannot comprehend. They can engage in what certainly appears to be a human like conversation. But the LLM has no "understanding" of the words it prints in the screen. And, based upon its design, it never will. It is a fascinating parlor trick, useful in its own way, but never deserving of the hype train of the last 24 months.

Secondly, the savings it will generate via automation are likely to be minuscule in comparison to the costs being expended by Microsoft and other tech companies. Clearly the top professionals know all this and have other reasons to spend the money. I submit Microsoft may not "win the AI race" - simply because there isn't any AI to win - but LLMs can still pose an existential threat to Google search. Meanwhile NVidea shareholders aren't going to open up about the hype train they are riding. Nor is new billionaire Sam Altman.

But the money being spent on LLMs can't be justified for long. When viewed by what it is - and isn't - the current costs of LLM technology simply dwarf the financial gains generated. I don't know when the hype train falls off its cliff, but fall it will. And with the whole world's economy having been impacted these past 24 months, the hype train cliff is likely to be a very, very steep one.

By @mikequinlan - 6 months

The current ChatGPT gets it right. Apple's team seems to be wrong.

Prompt:

See if you can solve this arithmetic problem:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

Answer: Let’s break it down step by step:

1. Friday: Oliver picks 44 kiwis.

2. Saturday: He picks 58 kiwis.

3. Sunday: He picks double the number he picked on Friday, so: 44 * 2 = 88 kiwis

4. Total number of kiwis picked: 44 + 58 + 88 = 190

The fact that five kiwis were smaller doesn’t affect the total count, so Oliver has 190 kiwis in total.

By @necovek - 6 months

This an extension of an even simpler prompt I used to demonstrate the same: "what day was yesterday if yesterday was a Thursday?"

Latest versions have picked up on that, but simply because this has now entered their training data.

Coming up with novel "trick" questions it has not seen in the training data will still confuse them due to their inherent architecture.

By @mgh2 - 6 months

> "Grady Booch, father UML, has been saying this for years. LLMs aren't intelligent and never will be, though they may get large and complex enough to simulate it. The problem really isn't the amount of data you feed it, it's the foundational architecture. LLMs are based on probability, not logic and understanding."

Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw

Related

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

The Sobering Reality of AI: A Researcher's Perspective

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

LLMs can't perform "genuine logical reasoning," Apple researchers suggest

Related

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

The Sobering Reality of AI: A Researcher's Perspective

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple Study Reveals Critical Flaws in AI's Logical Reasoning Abilities

LLMs can't perform "genuine logical reasoning," Apple researchers suggest