Popular AI model performance benchmark may be flawed, Meta researchers warn


'We've identified multiple loopholes with SWE-bench Verified,' the manager at Meta Platforms' AI research lab Fair says. — SCMP

A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.

“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.

The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.

OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.

Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.

The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.

“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

Alibaba Cloud’s Qwen is the world’s largest open-source AI ecosystem. Photo: Shutterstock

The revelation made by Meta’s Fair reflected how the shortcomings and limitations of widely used third-party benchmarks have come under increased scrutiny, as AI models have become more sophisticated.

New AI models, for example, could be trained to score high on prominent benchmarks by simply regurgitating training data – often referred to as “data leakage” – leading to so-called benchmark saturation in which a model’s benchmark score would describe little about its overall usefulness. Another issue is called “reward hacking”, in which models exploit loopholes in evaluation methodologies to achieve higher scores.

Efforts are also now under way to plug the loophole in SWE-bench Verified, according to Princeton University’s Carlos Jimenez, one of the researchers behind the benchmark.

“We’re debugging some outstanding issues ... and will push them as soon as possible,” he wrote in a comment under the original GitHub post of Kahn from Meta’s Fair.

Meanwhile, a growing number of AI researchers, including those in China, have initiated development of more comprehensive and useful benchmark systems to accurately evaluate the capabilities of AI models.

In late July, researchers from the Shanghai University of Finance and Economics and Fudan University established a benchmark for evaluating AI agents in the financial industry, with an emphasis on testing their practical usefulness in daily workflows. AI agents are software programs that are capable of autonomously performing tasks on behalf of a user or another system.

In May, Chinese venture capital firm HongShan Capital Group launched a so-called evergreen benchmark for AI agents called Xbench, a set of real-world tasks that is regularly updated to avoid issues surrounding saturation. – South China Morning Post

Follow us on our official WhatsApp channel for breaking news alerts and key updates!

Next In Tech News

Is social media harmful for kids? Meta and YouTube face US trial after TikTok settles suit
It’s not a product. This habit will be the biggest luxury of 2026
Apple spent years downplaying AI chatbots. Now Siri Is becoming one
US judge signals Musk's xAI may lose lawsuit accusing Altman's OpenAI of stealing trade secrets
Apple stole our revolutionary camera technology, British company claims in US district court lawsuit
Exclusive-Saks ending e-commerce partnership with Amazon, source says
Nvidia's plan to invest up to $100 billion in OpenAI has stalled, WSJ reports
Musk's Starlink updates privacy policy to allow consumer data to train AI
Google defeats bid for billions of dollars of new penalties in US privacy class action
Analysis-Combining SpaceX with xAI may be simple for Musk Inc, but Tesla isn't so easy

Others Also Read