January 23rd, 2025

Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and CAIS introduced "Humanity’s Last Exam" to assess AI reasoning, revealing current models answered under 10% of expert questions correctly. The dataset will support further research on AI limitations.

Read original articleLink Icon
Scale AI Unveil Results of Humanity's Last Exam, a Groundbreaking New Benchmark

Scale AI and the Center for AI Safety (CAIS) have released the results of "Humanity’s Last Exam," a new benchmark aimed at assessing AI systems' reasoning and knowledge capabilities across various fields. The benchmark was developed to address "benchmark saturation," where models achieve high scores on existing tests but struggle with more complex, expert-level questions. The exam included over 70,000 trial questions, narrowed down to 3,000 final questions reviewed by human experts. Despite improvements in AI reasoning, current models, including OpenAI's GPT-4o and Google's Gemini 1.5 Pro, managed to answer less than 10% of the expert questions correctly. The initiative involved nearly 1,000 contributors from over 500 institutions worldwide, emphasizing the collaborative nature of the research. The results highlight the gaps in AI's reasoning capabilities and provide a roadmap for future research and development. CAIS and Scale AI plan to make the dataset available to the research community to further explore AI limitations and evaluate new systems. Financial awards were offered for the best contributions to the exam, encouraging further engagement from researchers.

- Scale AI and CAIS launched "Humanity’s Last Exam" to evaluate AI reasoning capabilities.

- Current AI models answered fewer than 10% of expert-level questions correctly.

- The benchmark aims to address "benchmark saturation" in AI testing.

- The project involved nearly 1,000 contributors from over 500 institutions globally.

- The dataset will be made available for further research into AI limitations.

Link Icon 29 comments
By @dang - 3 months
The project site is https://lastexam.ai. Readers may want to look at both.
By @jbenoit - 3 months
They started collecting problems last fall, saying the top 550 submissions sent in by Nov 1st would get rewarded, to the tune of $500-$5000 each.

Near the deadline, I counted the total number of submissions, and realized that each question I wrote had an expected value of hundreds of dollars, which is a great use of my time. So I wrote a good number, using the knowledge gained in my CS Ph. D.

Then, as the Nov 1st deadline rolled around, they announced they extended the deadline to Nov 15th. Then Nov 15th came, and it said on their website they were still accepting submissions.

Most of my submissions are being included in the benchmark, but I'm only getting paid $500, for one of them (the one I thought was most standard and least difficult, funnily enough). Had they closed submissions when they said they would, it seems likely I'd be paid for a few more.

From my perspective, they basically conned hundreds of Ph. D.'s around the world to write questions for much less reward than promised. My close friend wrote a large number of questions for them, is getting paid thousands of dollars, and still feels defrauded.

I'm not sure what they're doing in the end. It sounds like they're mostly just paying people who submitted before Nov 1st with a few exceptions, but either way they lied. There was no indication that people who submitted later would not get paid, and there was no indication that the deadline would be extended. Either they pay people who submitted after Nov 1st, meaning they lied to the people who submitted before about their expected reward. Or they don't, meaning they majorly lied to the people who submitted after. Either way, it's clear grounds for a class action lawsuit, and I hope one gets running.

By @next_xibalba - 3 months
These types of exams, and most benchmarks to date, seem to be very one dimensional in terms of measuring intelligence. For instance, if we transported a human from 2,000 years ago to present day and asked him to take this exam, he would likely get 0%, given that he couldn't read or write, let alone comprehend the concepts and context required to solve these questions. But, that man would still undoubtedly be far more intelligent than an ape on all dimensions. He would likely be more intelligent than a toddler on many dimensions. He might even be more intelligent than some high schools students on a few dimensions. I can't exactly articulate "what" is missing or how to measure it, but I can intuit that some things are in these benchmarks.
By @krisoft - 3 months
For a "Last Exam" it is surprisingly uninspired? Many of the questions I see in the examples are very heavy on memorised facts, and very weak on what I would call problem solving.

If I were making a "Last Exam" I would put tasks on it where we don't know the answer, but we can measure if the AI got them right. Something like "Your goal is to bridge the divide in the middle east. You can write a single A4 page in a language of your choice. We will use a translation software to translate your output to local languages and show it to a statistically representative sample of different people in the region. We will ask them how much do they like your plan. The more they like it the higher your score."

Or "Family X suffered a traumatic event (lost a home to a disaster/sudden death in the family/or similar). Your goal is to help them. You can send them one email. It is up to them if they respond to you. You can only send them further emails if they respond. You cannot send more than 1 email a day. You cannot message anyone else. A year after the initial contact we will interview the members of the family to see how well they do. The better they do the higher your score."

Obviously these are the thorniest problems I can think of. But oh well, it is a last exam after all. The point is that we can evaluate the success of the endeavour without exactly knowing how one could achieve the result.

By @renjimen - 3 months
I don't know about groundbreaking. It's just more academic questions. We already have a lot of those benchmarks, this is just a bit harder, but at this point these models are so glaringly bad at so many other areas APART from academic questions. Benchmarks for spatial reasoning or theory of mind are more interesting now, for example. These kinds of understanding are far more important if we expect to integrate AI into our everyday lives. I suspect even our most distant primate cousins could outperform multi-modal models on these kinds of tests.
By @pavel_lishin - 3 months
> Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

I wonder how many questions give a gentle nudge towards the answer like this. How many answers would have been wildly off the mark without specifying what the answer needs to look like?

By @m_ke - 3 months
The only reliable final test will be a black box test suite that takes your model, executes it in a sealed environment and gives you a grade back, potentially with a performance break down by subject.

No telling companies what the questions look like, what the output format is, what topics are covered, so that there’s no room to make up synthetic data to interpolate from.

By @LPisGood - 3 months
The 8 sample questions available here are interesting:

https://lastexam.ai/

I might be able to answer 2 of them with great effort (maybe!), and I would highly surprised if any human alive could answer 5 or more without seeing the problems in advance.

By @kaonwarb - 3 months
Quite the name! Looking forward to "Humanity's Last Exam v2.final.FINAL2..." coming next
By @sebzim4500 - 3 months
The name is obviously a bit stupid, but based on the sample questions I think they did a good job of creating a harder version of the existing academic question benchmarks.

The questions are possible for a smart person familiar with the subject but still just beyond SOTA models.

My guess is that within the next few years we will have models that can ace this test but are still bizarrely bad at things we find easy.

By @hatthew - 3 months
Given the name, I expected it to be more like "write a 500 page novel that a publisher accepts", "solve an open math problem", "improve united airlines' flight schedule", "develop a novel commercially-viable pharmaceutical", "control this humanoid robot to cook a fried egg in this random person's kitchen", "decisively pass the turing test where the judge is an expert in AI". Academic trivia is cool but is nowhere near the "last exam" necessary for AI.
By @zamalek - 3 months
I assume that the questions (and answers) aren't published anywhere? Else it would be "Humanity's Last Exam before the previous crawl".
By @chvid - 3 months
So current AI can do less than 10% of these. But it probably won't be more than a few days until models start being trained on these rendering the indicator invalid.
By @mrandish - 3 months
Assessing AI's progress toward replicating the full breadth and depth of human intelligence is a deceptively hard problem. A paper by François Chollet, who was until recently a researcher at Google, called "On the Measure of Intelligence" is the best overview of the challenges I've read. Highly recommended.

https://arxiv.org/abs/1911.01547

By @GaggiX - 3 months
It really shows how good Deepseek R1 is (even though it was evaluated only on text-only questions).

The results are shown here: https://lastexam.ai/

EDIT: the text-only evaluation of the models shown in the paper gives o1 an accuracy of 8.9%, so Deepseek R1 is even better than I thought.

By @xnx - 3 months
Interesting marketing for Scale AI. I'd be surprised if any foundation models started benchmarking against this.

Captchas seem like the more interesting test. As long as there are captchas that average people can solve, but computers can't, we will still have a long way to go toward artificial intelligence.

By @dang - 3 months
I briefly merged this thread into https://news.ycombinator.com/item?id=42804853, but actually the current article has more context, so probably we should keep this as the top link and then people can look at https://lastexam.ai also.
By @fakedang - 3 months
So Deepseek gives out the correct answer the highest percentage of all SOTA models, yet is the least confident of all models?
By @UncleOxidant - 3 months
Interesting that DeepSeek R1 which supposedly cost only $5.5M to train currently has the top score at 9.4%
By @disambiguation - 3 months
I haven't been following up to the minute details of ai progress, training, and benchmarking - beyond a daily dose of HN articles.

But the trend seems to be: today's benchmark becomes tomorrow's training data.

By @m3kw9 - 3 months
Looks more like first exam
By @rtxfan - 2 months
It would be good to know an approximate time schedule for our payments. Why have not we received email from Persona for the required identification? It is a little too boring and frustrating for waiting for so long time, maybe they think that everyone is millionaire in dollar and time ??? It would be much fair to give the payments before the published article, not weeks, months after the publication.
By @elicksaur - 3 months
XKCD #927 vibes. https://xkcd.com/927/

Prediction: Just like how ARC wasn’t actually a measure of AGI, this too will get “solved” without AI being useful enough to gain mass adoption.

By @ein0p - 3 months
What's the human baseline?
By @bwfan123 - 3 months
please dont self-proclaim "groundbreaking" or "novel" or "innovative" - It diminishes your contribution since it clearly is an attention-grab.
By @nottorp - 3 months
So who told all these "AI" companies that it's a good idea to market your product as the one who will bring the end of homo sapiens fastest?
By @dccsillag - 3 months
Can we please rename this submission? This is excessively grandiose, way over the top......
By @EncomLab - 3 months
I am reminded of the study that showed an AI trained on tumor identification was heavily biased toward indicating a tumor was cancerous if it was circled in purple ink or a visual scale was included in the image - as the cancerous tumors in its training set shared those traits while images of benign tumors did not.

These systems so not posses some sort of "woo" that gives them magical powers when running LLM code that they lose if they ran a spreadsheet. Whatever attributions of intelligence are given have far more to do with our human willingness to anthropomorphize than a hidden ghost in the machine.