July 18th, 2024

Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

Benchmarks used to assess AI models may mislead, lacking crucial insights. Google and Meta's AI boasts are criticized for outdated, unreliable tests. Experts urge more rigorous evaluation methods amid concerns about AI's implications.

Read original articleLink Icon
Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless

The article discusses how benchmarks used to evaluate AI models may be misleading and lack meaningful insights into the capabilities of artificial intelligence products. Companies like Google and Meta often boast about their AI models' performance on these tests, but experts argue that the benchmarks are outdated, sourced from amateur websites, and do not assess crucial aspects like the ability to provide reliable answers or avoid false information. The article highlights concerns raised by researchers about the quality and relevance of these benchmarks, especially when applied to critical areas like healthcare or law. Despite the popularity of these benchmarks in the AI industry, experts emphasize the need for more rigorous and accurate evaluation methods. The piece also touches on the broader implications of AI technology and the increasing scrutiny it faces from policymakers. Researchers caution against misplaced trust in AI models based on benchmark scores, warning that these scores may not reflect the models' actual understanding or reasoning abilities. The article underscores the importance of transparency and responsible use of AI technology, especially in fields like healthcare and law where the stakes are high.

Link Icon 6 comments
By @sho - 3 months
I paid to upgrade my Anthropic account to pro today, ending a long monogamy with OpenAI, and one thing that struck me was how hard to describe the advantages of one over the other were. I like claude.ai's "style" more, and prefer the "interface" - basically the "way" they talk over the strict correctness of what they said.

Hate on LLM-AIs but if you told me 5 years ago I'd be switching my AI provider because I liked another one's style better, I'd have thought you were bonkers. Shit's come a long way.

By @sigmoid10 - 3 months
This actually speaks volumes about the progress AI hs made in recent years. Its capabilities have become so intangible that people begin to attack standardised testing - just like they do for humans. Because we all know that a human who does good on a test might still suck at the real job and vice-versa. If we get to the point where you we need to do personal interviews with new models to see if they could be used for a certain job, a lot of people will get the denial rug pulled out from under them pretty hard.
By @viraptor - 3 months
Related: "let's make leaderboards steep again" https://huggingface.co/spaces/open-llm-leaderboard/blog
By @tkgally - 3 months
I currently use ChatGPT 4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro for various personal and work-related tasks, and I often compare them on the same task. It’s really hard to decide which performs best overall for me. Sometimes one will be obviously better or worse for a relatively trivial reason, such as a longer context window or overeager censorship (looking at you for both, Gemini!). But usually, especially for the extended back-and-forth interactions that I find most useful, I am unable to state objectively which model is better.
By @cowboylowrez - 3 months
I'd like to see a test like "chance of hallucination per prompt." Obviously, this test rating is LLM specific, because if you were rating a humans ability during an interview for example, once you detected a hallucination, you'd politely end the interview and lock your door once the interviewee left the premises.
By @Nasrudith - 3 months
The AIs are doing an ironic highlighting in reverse: that many of the ways we benchmark humans are actually kind of sucky.