Opinion: When AI passes this test, look out

Hendrycks worked with Scale AI, an AI company where he is an adviser, to compile the test, which consists of roughly 3,000 multiple-choice and short answer questions designed to test AI systems’ abilities in areas including analytic philosophy and rocket engineering. — ©2025 The New York Times Company

SAN FRANCISCO: If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that AI systems can’t pass.

For years, AI systems were measured by giving new models a variety of standardised benchmark tests. Many of these tests consisted of challenging, SAT-caliber problems in areas like math, science and logic. Comparing the models’ scores over time served as a rough measure of AI progress.

But AI systems eventually got too good at those tests, so new, harder tests were created – often with the types of questions graduate students might encounter on their exams.

Those tests aren’t in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many doctorate-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are AI systems getting too smart for us to measure?

This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called “Humanity’s Last Exam”, that they claim is the hardest test ever administered to AI systems.

Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known AI safety researcher and director of the Center for AI Safety. (The test’s original name, “Humanity’s Last Stand”, was discarded for being overly dramatic.)

Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to.

The creators of a new test, dubbed ‘Humanity’s Last Exam’, argue we may soon lose the ability to create tests hard enough for AI models. — ©2025 The New York Times Company

Here, try your hand at a question about hummingbird anatomy from the test:

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number.

Or, if physics is more your speed, try this one:

A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?

(I would print the answers here, but that would spoil the test for any AI systems being trained on this column. Also, I’m far too dumb to verify the answers myself.)

The questions on Humanity’s Last Exam went through a two-step filtering process. First, submitted questions were given to leading AI models to solve.

If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between US$500 and US$5,000 (RM2,194 and RM21,937) per question, as well as receiving credit for contributing to the exam.

Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were “along the upper range of what one might see in a graduate exam”.

Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create harder AI tests by a conversation with Elon Musk. (Hendrycks is also a safety adviser to Musk’s AI company, xAI.) Musk, he said, raised concerns about the existing tests given to AI models, which he thought were too easy.

“Elon looked at the MMLU questions and said, ‘These are undergrad level. I want things that a world-class expert could do’,” Hendrycks said.

There are other tests trying to measure advanced AI capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet.

But Humanity’s Last Exam is aimed at determining how good AI systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score.

“We are trying to estimate the extent to which AI can automate a lot of really difficult intellectual labor,” Hendrycks said.

Once the list of questions had been compiled, the researchers gave Humanity’s Last Exam to six leading AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3%.

(The New York Times has sued OpenAI and its partner, Microsoft, accusing them of copyright infringement of news content related to AI systems. OpenAI and Microsoft have denied those claims.)

Hendrycks said he expected those scores to rise quickly, and potentially to surpass 50% by the end of the year. At that point, he said, AI systems might be considered “world-class oracles,” capable of answering questions on any topic more accurately than human experts. And we might have to look for other ways to measure AI’s impacts, like looking at economic data or judging whether it can make novel discoveries in areas like math and science.

“You can imagine a better version of this where we can give questions that we don’t know the answers to yet, and we’re able to verify if the model is able to help solve it for us,” said Summer Yue, Scale AI’s director of research and an organiser of the exam.

Part of what’s so confusing about AI progress these days is how jagged it is. We have AI models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges.

But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast AI is improving, depending on whether you’re looking at the best or the worst outputs.

That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for AI systems. I still believe that. But I also believe that we need more creative methods of tracking AI progress that don’t rely on standardised tests, because most of what humans do – and what we fear AI will do better than us – can’t be captured on a written exam.

Zhou, the theoretical particle physics researcher who submitted questions to Humanity’s Last Exam, told me that while AI models were often impressive at answering complex questions, he didn’t consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers.

“There’s a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an AI that can answer these questions might not be ready to help in research, which is inherently less structured.” – ©2025 The New York Times Company

Topic:

AI Technology Education

Report a mistake

What is the issue about?

Spelling and grammatical error

Factually incorrect

Story is irrelevant

Thank you for your report!

Related News

AI chatbots hit the dating scene, becoming the lovelorn's modern-day Cyrano

AI 2h ago

Opinion: When AI passes this test, look out

ENERGISING THE NEXT GENERATION

Others Also Read

Thank you for downloading.

Opinion: When AI passes this test, look out

Related Stories

AI chatbots hit the dating scene, becoming the lovelorn's modern-day Cyrano

Dnex pivots towards sovereign AI growth

Can AI teach values and wisdom?

Related stories:

Related News

AI chatbots hit the dating scene, becoming the lovelorn's modern-day Cyrano

Dnex pivots towards sovereign AI growth

Can AI teach values and wisdom?

ENERGISING THE NEXT GENERATION

Trending in Tech

Others Also Read

Thank you for downloading.