Is Claude Opus 4.6 Getting a Secret 'Nerf' Update? BridgeBench's Viral Callout Gets Absolutely Demolished by Statistics

BridgeMind AI's viral claim that Anthropic's Claude Opus 4.6 was secretly degraded has hit a wall of criticism for questionable methodology. In what can only be described as the most dramatic episode of "my token just dumped and I'm looking for someone to blame," the team behind the BridgeBench coding benchmark posted that Claude Opus 4.6 fell from second to tenth place on its hallucination leaderboard. Accuracy reportedly dropped from 83.3% to 68.3%. Someone get these people a statistician and maybe a hug.

"CLAUDE OPUS 4.6 IS NERFED. BridgeBench just proved it. Last week Claude Opus 4.6 ranked #2 on the Hallucination benchmark with an accuracy of 83.3%. Today Claude Opus 4.6 was retested and it fell to #10 on the leaderboard with an accuracy of only 68.3%," they wrote. Ah yes, nothing says rigorous scientific methodology like one tweet with all caps and zero context. The algorithm wasn't updated. The benchmark just got slightly hungover and forgot how numbers work.

Critics quickly called foul. Computer scientist Paul Calcraft was blunt: "Incredibly bad science. You tested Opus on 30 tasks today, previous score was on just 6 tasks. Results for 6 tasks in common: 85.4% score today vs. 87.6% previously. Swing is mostly from a single fabrication without repeats – easily statistical noise." Imagine judging a DeFi protocol's TVL based on one hour of data during a gas war. That's basically what happened here, except it was a coding benchmark instead of a liquidity pool having a bad day.

The original high score came from just six benchmark tasks. The new retest expanded to 30. On the six overlapping tasks, performance dropped only from 87.6% to 85.4%—nearly identical. We're talking a swing of about 2.2 percentage points here, folks. That's not a nerf. That's variance. That's the difference between your buddy saying "the food was mid" and "the food was fire" on two different visits to the same taco truck.

That small swing came from one extra fabrication in a single task. With no repeated runs, this falls within normal statistical variance

Is Claude Opus 4.6 Getting a Secret 'Nerf' Update? BridgeBench's Viral Callout Gets Absolutely Demolished by Statistics

Share Article

Quick Info