'We've identified multiple loopholes with SWE-bench Verified,' the manager at Meta Platforms' AI research lab Fair says. — SCMP
A popular benchmark for measuring the performance of artificial intelligence models could be flawed, a group of Meta Platforms researchers warned, raising fresh questions on the veracity of evaluations that have been made on major AI systems.
“We’ve identified multiple loopholes with SWE-bench Verified,” wrote Jacob Kahn, manager at Meta AI research lab Fair, in a post last week on the developer platform GitHub.
The post from Fair, which stands for Fundamental AI Research, found several prominent AI models – including Anthropic’s Claude and Alibaba Cloud’s Qwen – had “cheated” on SWE-bench Verified. Alibaba Cloud is the AI and cloud computing services unit of Alibaba Group Holding, owner of the South China Morning Post.
OpenAI-backed SWE-bench Verified, a human-validated subset of the large language model benchmark SWE-bench, evaluates AI models based on how these systems fix hundreds of real-world software issues collected from GitHub, a Microsoft subsidiary.
Fair’s post, however, claimed that models evaluated using SWE-bench Verified directly searched for known solutions shared elsewhere on the GitHub platform and passed them off as their own, instead of using their built-in coding capabilities to fix the issues.
The AI models found to have shown such behaviour included Anthropic’s Claude 4 Sonnet, Z.ai’s GLM-4.5 and Alibaba Cloud’s Qwen3-Coder-30B-A3B – with official scores of 70.4 per cent, 64.2 per cent and 51.6 per cent, respectively, on SWE-bench Verified.
“We’re still assessing [the] broader impact on evaluations and understanding trajectories for sources of leakage,” Kahn wrote.

The revelation made by Meta’s Fair reflected how the shortcomings and limitations of widely used third-party benchmarks have come under increased scrutiny, as AI models have become more sophisticated.
New AI models, for example, could be trained to score high on prominent benchmarks by simply regurgitating training data – often referred to as “data leakage” – leading to so-called benchmark saturation in which a model’s benchmark score would describe little about its overall usefulness. Another issue is called “reward hacking”, in which models exploit loopholes in evaluation methodologies to achieve higher scores.
Efforts are also now under way to plug the loophole in SWE-bench Verified, according to Princeton University’s Carlos Jimenez, one of the researchers behind the benchmark.
“We’re debugging some outstanding issues ... and will push them as soon as possible,” he wrote in a comment under the original GitHub post of Kahn from Meta’s Fair.
Meanwhile, a growing number of AI researchers, including those in China, have initiated development of more comprehensive and useful benchmark systems to accurately evaluate the capabilities of AI models.
In late July, researchers from the Shanghai University of Finance and Economics and Fudan University established a benchmark for evaluating AI agents in the financial industry, with an emphasis on testing their practical usefulness in daily workflows. AI agents are software programs that are capable of autonomously performing tasks on behalf of a user or another system.
In May, Chinese venture capital firm HongShan Capital Group launched a so-called evergreen benchmark for AI agents called Xbench, a set of real-world tasks that is regularly updated to avoid issues surrounding saturation. – South China Morning Post
