Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark
Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.
Read original articleScale AI and the Center for AI Safety (CAIS) have released the results of "Humanity’s Last Exam," a new benchmark aimed at assessing AI systems' reasoning and knowledge capabilities across various fields. The benchmark was developed to address "benchmark saturation," where models achieve high scores on existing tests but struggle with more complex, expert-level questions. The exam included over 70,000 trial questions, narrowed down to 3,000 final questions reviewed by human experts. Despite improvements in AI reasoning, current models, including OpenAI's GPT-4o and Google's Gemini 1.5 Pro, managed to answer less than 10% of the expert questions correctly. The initiative involved nearly 1,000 contributors from over 500 institutions worldwide, emphasizing the collaborative nature of the research. The results highlight the gaps in AI's reasoning capabilities and provide a roadmap for future research and development. CAIS and Scale AI plan to make the dataset available to the research community to further explore AI limitations and evaluate new systems. Financial awards were offered for the best contributions to the exam, encouraging further engagement from researchers.
- Scale AI and CAIS launched "Humanity’s Last Exam" to evaluate AI reasoning capabilities.
- Current AI models answered fewer than 10% of expert-level questions correctly.
- The benchmark aims to address "benchmark saturation" in AI testing.
- The project involved nearly 1,000 contributors from over 500 institutions globally.
- The dataset will be made available for further research into AI limitations.
Related
Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless
Benchmarks used to assess AI models may mislead, lacking crucial insights. Google and Meta's AI boasts are criticized for outdated, unreliable tests. Experts urge more rigorous evaluation methods amid concerns about AI's implications.
Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw
Apple researchers found that many AI models struggle with basic arithmetic when irrelevant data is included, highlighting a lack of genuine logical reasoning and cautioning against overestimating AI's intelligence.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
AI can learn to think before it speaks
Recent advancements in AI, particularly OpenAI's o1 model, enhance reasoning capabilities but raise concerns about deception and safety. Further development and regulatory measures are essential for responsible AI evolution.
Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.
Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.
Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.
From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.
I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.
If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."
Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."
Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.
I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?
No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.
I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.
The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.
My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.
The results are shown here: https://lastexam.ai/
EDIT: the text-only evaluation of the models shown in the paper gives o1 an accuracy of 8.9%, so Deepseek R1 is even better than I thought.
Captchas seem like the more interesting test. As long as there are captchas that average people can solve, but computers can't, we will still have a long way to go toward artificial intelligence.
But the trend seems to be: today's benchmark becomes tomorrow's training data.
Prediction: Just like how ARC wasn’t actually a measure of AGI, this too will get “solved” without AI being useful enough to gain mass adoption.
These systems so not posses some sort of "woo" that gives them magical powers when running LLM code that they lose if they ran a spreadsheet. Whatever attributions of intelligence are given have far more to do with our human willingness to anthropomorphize than a hidden ghost in the machine.
Related
Everyone Is Judging AI by These Tests. Experts Say They're Close to Meaningless
Benchmarks used to assess AI models may mislead, lacking crucial insights. Google and Meta's AI boasts are criticized for outdated, unreliable tests. Experts urge more rigorous evaluation methods amid concerns about AI's implications.
Apple researchers ran an AI test that exposed a fundamental 'intelligence' flaw
Apple researchers found that many AI models struggle with basic arithmetic when irrelevant data is included, highlighting a lack of genuine logical reasoning and cautioning against overestimating AI's intelligence.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with leading models solving under 2% of its expert-level problems, highlighting current AI limitations and requiring human expertise.
New secret math benchmark stumps AI models and PhDs alike
Epoch AI has launched FrontierMath, a challenging benchmark for AI and mathematicians, with models solving under 2% of expert-level problems. It remains unpublished to ensure fair assessments and future evaluations.
AI can learn to think before it speaks
Recent advancements in AI, particularly OpenAI's o1 model, enhance reasoning capabilities but raise concerns about deception and safety. Further development and regulatory measures are essential for responsible AI evolution.