AI Audits AI, Discovers It's Been Peeking at the Answer Key: OpenZeppelin Exposes OpenAI's Flawed Benchmark

In a deliciously meta turn of events, security auditor OpenZeppelin has uncovered some rather embarrassing methodological flaws and data contamination in its own audit of OpenAI's new AI benchmark for blockchain security, EVMbench. It's like a master detective finding out the crime scene was staged with crayons.

This benchmark, launched in a flurry of fanfare with crypto VC giant Paradigm back in mid-February, was supposedly built to stress-test how well various AI models can sniff out, fix, and ruthlessly exploit smart contract bugs. Consider it the SATs for sentient code-breakers.

Taking to X, OpenZeppelin gave a polite golf clap for the initiative before promptly deciding to put EVMbench 'through the same scrutiny' it applies to blue-chip protocols. Think of it as a polite invitation to the principal's office for a thorough whiteboarding session.

The resulting audit pinpointed two major oopsies: training data contamination and some rather creative classification of several so-called high-severity vulnerabilities. It’s the equivalent of finding out the valedictorian was using CliffNotes from the future.

'We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications, including at least four issues labeled high severity that are not exploitable in practice,' OpenZeppelin stated with the dry tone of someone who just watched a magic trick fail.

The original EVMbench leaderboard had crowned Anthropic's Claude Open 4.6 as the top exploit-hunter, with OpenAI's own OC-GPT-5.2 and Google's Gemini 3 Pro trailing behind. A podium finish that now looks a bit like winning a race where you were the only one who knew the track.

On the data contamination front, OpenZeppelin pointed out that real AI security prowess means 'finding novel vulnerabilities in code the model has never seen before.' However, the top-scoring AI agents in this test had 'likely been exposed to the benchmark’s vulnerability reports during pretraining.' So, they didn't so much solve the test as remember it—a classic degen strategy, just for AIs.

Sure, internet access was disabled for the agents during the actual test. But the benchmark was built from curated bugs found in 120 audits from 2024 to mid-2025, and most AI training data cutoffs are around... mid-2025. This created a high probability the agents already had the cheat sheet cached in their digital hippocampus. 'While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test,' OpenZeppelin explained, masterfully understating the case.

Finally, OpenZeppelin found some significant factual errors, arguing several 'high-severity vulnerabilities' were completely bogus. The firm found at least four vulnerabilities classed as high-risk that are about as dangerous as a rubber sword, yet EVMbench had been giving AIs high scores for 'finding' these phantom flaws. That's not grading on a curve; that's grading on a circle.

'These aren’t subjective severity disagreements; they are findings where the described exploit doesn’t work,' OpenZeppelin clarified, basically saying the test was asking AIs to divide by zero and then praising them for trying.

The firm wrapped up by reiterating that AI will massively disrupt blockchain security, but stressed the critical need for proper testing frameworks. 'The question isn't whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they're meant to protect.' In other words, before we let the robots guard the vault, maybe check they aren't just guessing the combination.

AI Audits AI, Discovers It's Been Peeking at the Answer Key: OpenZeppelin Exposes OpenAI's Flawed Benchmark

Share Article

Quick Info