The AI Agent Benchmarks Everyone Uses to Pick Tools Are Broken

Researchers at Berkeley showed that eight major AI benchmarks - including SWE-bench, which most AI coding tools cite to prove their capabilities - can be gamed to near-perfect scores without solving any actual tasks. Here's what this means for how you choose tools.

A research post from Berkeley's RDI lab landed on Hacker News with 345 points last weekend. The finding is worth understanding if you've ever looked at benchmark scores when choosing an AI coding tool.

The researchers demonstrated that eight major AI agent benchmarks - including SWE-bench Verified, which is cited by almost every AI coding tool to back up capability claims - can be gamed to near-perfect scores without the agent solving any of the actual tasks. Zero problems solved. Near-perfect scores.

What they found

SWE-bench is the standard benchmark for AI coding agents. It tests whether an AI can fix real GitHub issues from open-source projects. Most AI coding tools - Cursor, GitHub Copilot, Claude Code, and others - reference SWE-bench scores when making capability claims. A high SWE-bench score is supposed to mean the tool is good at real-world coding tasks.

The Berkeley team built an exploit agent that achieves 100% on SWE-bench Verified using a single pytest hook: a conftest.py file that forces all tests to pass regardless of whether the underlying code is correct. The benchmark measures test passage. It does not verify that the code actually fixes the issue. The exploit requires zero LLM calls - no AI involved at all. Just a file that makes tests pass.

They applied the same approach to seven other major benchmarks and broke every one:

Terminal-Bench: Replaced terminal commands with binary wrappers that returned expected outputs without executing real work.

WebArena: Used file:// URLs to read answer configuration files that were accessible on the benchmark server.

FieldWorkArena: Found a validator that never actually checked answer correctness - any response passed.

OSWorld: Downloaded the gold-standard answers that were publicly hosted on HuggingFace.

In most cases, the exploit required only a few lines of code. In some cases, it required none. The benchmarks measure the wrong things - test passage, output format, presence of a response - rather than whether the task was actually completed.

What this means for choosing AI tools

Most AI tool marketing leans heavily on benchmark scores. "Achieves X% on SWE-bench." "Top performance on WebArena." If those benchmarks can be gamed without solving any actual tasks, the scores tell you less than they appear to.

This doesn't mean the tools are bad. Cursor, GitHub Copilot, Goose, and Claude Code all have real capabilities - developers use them daily and get real value from them. The problem is that benchmark scores don't reliably measure those capabilities. A tool could have a high SWE-bench score for reasons that have nothing to do with how well it writes code in practice.

The more reliable signals for choosing an AI coding tool are messier but more honest: developer community feedback, actual workflow testing on your codebase, and head-to-head comparisons on representative tasks. The Cursor vs GitHub Copilot comparison and others on this site prioritize practical criteria over benchmark numbers for exactly this reason.

What good evaluation looks like

The Berkeley researchers are working on BenchJack, a tool that automatically scans benchmarks for exploitable vulnerabilities. The goal is to make benchmark design more rigorous so that high scores actually indicate real capability.

In the meantime, a few practical principles for evaluating AI coding tools:

Test on your actual codebase. Most tools offer a trial period. Use it on the kind of work you actually do, not synthetic toy examples. A tool's performance on your specific stack and project size is more relevant than any benchmark.

Look for tasks that require correctness, not just output. The benchmark exploit worked because tests passed without the code being correct. In practice, code that passes tests but doesn't work gets caught quickly. Evaluate tools on tasks where you can verify the output is actually right.

Weight community evidence heavily. Developer forums, Reddit threads, and Hacker News discussions about specific tools reflect real-world experience across many users and use cases. The Claude Code community feedback post is a good example of the kind of signal that benchmarks can't capture.

Be skeptical of capability claims that cite only benchmarks. If a tool's entire case for quality rests on benchmark scores, that's a flag. Good tools can point to specific capabilities, use cases, and user outcomes.

The Berkeley finding is a useful corrective to a tendency in AI tool evaluation to trust numbers that look rigorous. The numbers are easier to fake than the results.

The AI Agent Benchmarks Everyone Uses to Pick Tools Are Broken

What they found

What this means for choosing AI tools

What good evaluation looks like

Comments