September 3rd, 2024

AI worse than humans in every way at summarising information, trial finds

A trial by ASIC found AI less effective than humans in summarizing documents, with human summaries scoring 81% compared to AI's 47%. AI often missed context and included irrelevant information.

Read original articleLink Icon
AI worse than humans in every way at summarising information, trial finds

A recent trial conducted by Amazon for Australia's corporate regulator, the Securities and Investments Commission (ASIC), revealed that artificial intelligence (AI) is less effective than humans at summarizing documents. The trial involved comparing summaries generated by the AI model Llama2-70B with those produced by ten ASIC staff members. Reviewers assessed the summaries based on criteria such as coherency, length, and the ability to identify relevant references. The results showed that human summaries scored 81% on the evaluation rubric, while AI summaries only achieved 47%. Reviewers noted that AI often failed to capture nuance and context, sometimes including incorrect or irrelevant information. Consequently, they expressed concerns that AI-generated summaries could create additional work due to the need for fact-checking against original documents. Although the report acknowledged that advancements in AI technology could improve summarization capabilities in the future, it emphasized that human critical analysis remains superior. The findings suggest that AI should be viewed as a tool to assist rather than replace human efforts in summarization tasks.

- A government trial found AI to be less effective than humans in summarizing documents.

- Human summaries scored significantly higher than AI summaries in various evaluation criteria.

- Reviewers noted that AI often missed context and included irrelevant information.

- The trial highlighted the potential for AI to create additional work due to fact-checking needs.

- Future advancements in AI may improve summarization, but human analysis is currently unmatched.

Link Icon 9 comments
By @District5524 - 8 months
Summarization is one of those key functionalities of LLMs that laypeople can also easily understand and relate to. I think this article also underpins the hunch that results in ROUGE and similar NLP benchmarks are not necessarily a guarantee for good performance in sector-specific summarization tasks, where the expectations are defined by human users for a specific domain. It reminds me of the Stanford study on the use of LLMs for legal QA including the eye-wateringly expensive legal-specific LLMs: https://hai.stanford.edu/news/ai-trial-legal-models-hallucin...
By @pwatsonwailes - 8 months
I find it weird anyone is surprised by this. LLMs don't understand what they're doing at a fundamental level, so when you ask for a summary, you're asking for it to make something more concise. Which it will proceed to do, with no knowledge of the relative importance of what it's pruning out.

There's a whole category of issues around this that I don't see how the current formulation of AI based on LLMs can solve.

By @brrrrrm - 8 months
> The most promising model, Meta’s open source model Llama2-70B

This is an old model that was not dominant even when released. This study must be fairly old or I question the qualification of the group running it.

By @qgin - 8 months
Currently worse than trained, experienced, intelligent humans with domain knowledge? Sure.

Better than an average human? Absolutely.

By @xnx - 8 months
> every way

So not cheaper, faster, or more consistent?

Sounds more like "worse than humans in the few ways measured in this limited trial"

By @OutOfHere - 8 months
What would actually help is a plot that shows how the versions of Llama models are getting better (or not) at this summarization task, and then attempt to estimate when they will reach a human standard. This would allow us to see things in perspective.
By @rsynnott - 8 months
This feels a _bit_ like 'water is wet, study finds'. Like, I'm a little surprised that they felt the need to run the trial; even the most overconfident AI booster would probably have been reluctant to tell them they'd find anything different.