February 19th, 2025

Older AI models show signs of cognitive decline, study shows

A study in the BMJ reveals older AI models show cognitive decline, raising concerns for medical diagnostics. Critics argue human cognitive tests are unsuitable for evaluating AI performance.

Read original article

Older AI models show signs of cognitive decline, study shows

A recent study published in the BMJ indicates that older AI models, particularly large language models (LLMs) and chatbots, exhibit signs of cognitive decline, similar to humans. The research involved testing various LLMs, including OpenAI's ChatGPT and Alphabet's Gemini, using the Montreal Cognitive Assessment (MoCA), a tool typically used to evaluate cognitive impairment in humans. While newer models like ChatGPT version 4 scored relatively well, older models like Gemini 1.0 performed poorly, raising concerns about their reliability in medical diagnostics. Critics of the study argue that applying human cognitive tests to AI is inappropriate, as the MoCA was designed for human cognition and does not align with the operational framework of LLMs. They suggest that the study's methodology and framing anthropomorphize AI, leading to misleading conclusions. The study's authors acknowledge the limitations of their findings and emphasize the need for a critical examination of AI's role in clinical settings, particularly in tasks requiring visual and executive functions. The debate continues, with some experts calling for more rigorous testing of AI models over time to better understand their cognitive capabilities.

- Older AI models show signs of cognitive decline, raising concerns for medical diagnostics.

- The study used the Montreal Cognitive Assessment (MoCA) to evaluate AI performance.

- Critics argue that human cognitive tests are not suitable for AI evaluation.

- Newer models performed better than older ones, highlighting potential reliability issues.

- The study emphasizes the need for critical assessment of AI's role in healthcare.

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

Benchmarks used to assess AI models may mislead, lacking crucial insights. Google and Meta's AI boasts are criticized for outdated, unreliable tests. Experts urge more rigorous evaluation methods amid concerns about AI's implications.

IRL 25: Evaluating Language Models on Life's Curveballs

A study evaluated four AI models—Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and Mistral Large—on real-life communication tasks, revealing strengths in professionalism but weaknesses in humor and creativity.

Apple study proves LLM-based AI models are flawed because they cannot reason

Apple's study reveals significant reasoning shortcomings in large language models from Meta and OpenAI, introducing the GSM-Symbolic benchmark and highlighting issues with accuracy due to minor query changes and irrelevant context.

A.I. Chatbots Defeated Doctors at Diagnosing Illness

A study found ChatGPT outperformed human doctors in diagnosing medical conditions, achieving 90% accuracy compared to 76% for doctors using the AI and 74% for those not using it.

Study: Almost all leading AI chatbots show signs of cognitive decline

A study in The BMJ found leading AI chatbots show cognitive decline, with ChatGPT 4o scoring highest. Limitations in visuospatial skills and executive functions may hinder their clinical effectiveness.

3 comments

By @snypher - about 2 months

So did the previous generation models test higher when released, and now don't test as high? I may have misread the article but it seemed like the older models had a worse score all along, which can't really be called 'decline'?

By @xg15 - about 2 months

...by which they mean "earlier-generation models have lower scores in a test measuring cognitive function than newer models".

The entire study was an exercise in academic clickbait, something that even other scientists complained about:

> Other scientists have been left unconvinced about the study and its findings, going so far as to critisize the methods and the framing — in which the study's authors are accused of anthropomorphizing AI by projecting human conditions onto it. There is also criticism of the use of MoCA. This was a test examined purely for use in humans, it is suggested, and would not render meaningful results if applied to other forms of intelligence.

The study authors defend with the classic "It's just a joke, bro" card:

> Responding to the discussion, lead author of the study Roy Dayan, a doctor of medicine at the Hadassah Medica Center in Jerusalem, commented that many of the responses to the study have taken the framing too literally. Because the study was published in the Christmas edition of the BMJ, they used humor to present the findings of the study — including the pun "Age Against the Machine" — but intended the study to be considered seriously.

By @jcz_nz - about 2 months

This “study” is… total BS. I’d be super suspicious of anything else Roy Dayan ever puts his name on. Clickbait is the most generous interpretation.

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

IRL 25: Evaluating Language Models on Life's Curveballs

Apple study proves LLM-based AI models are flawed because they cannot reason

A.I. Chatbots Defeated Doctors at Diagnosing Illness

A study found ChatGPT outperformed human doctors in diagnosing medical conditions, achieving 90% accuracy compared to 76% for doctors using the AI and 74% for those not using it.

Older AI models show signs of cognitive decline, study shows

Related

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

IRL 25: Evaluating Language Models on Life's Curveballs

Apple study proves LLM-based AI models are flawed because they cannot reason

A.I. Chatbots Defeated Doctors at Diagnosing Illness

Study: Almost all leading AI chatbots show signs of cognitive decline

Related

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

IRL 25: Evaluating Language Models on Life's Curveballs

Apple study proves LLM-based AI models are flawed because they cannot reason

A.I. Chatbots Defeated Doctors at Diagnosing Illness

Study: Almost all leading AI chatbots show signs of cognitive decline