AGI? More Like 'A' for Effort: New Benchmark Shows AI Can't Even Play Simple Games Without Instructions

Last week, Nvidia CEO Jensen Huang went on a podcast and essentially declared mission accomplished, stating, "I think we've achieved AGI." Two days later, a new benchmark for artificial general intelligence dropped—and every top AI model scored below 1%, a reality check so cold it could freeze a GPU farm.

The ARC Prize Foundation just released ARC-AGI-3, and the results are a masterclass in humility. Google’s Gemini 3.1 Pro led the pack with a staggering 0.37%. OpenAI’s GPT-5.4 scored 0.26%. Anthropic’s Claude Opus 4.6 managed 0.25%, while xAI’s Grok-4.20 posted a perfect zero, a score usually reserved for my last trade idea. Humans, for comparison, solved 100% of the environments.

This isn't another trivia or coding test where models can regurgitate their training data. ARC-AGI-3 drops AI agents into 135 original, game-like worlds with zero instructions, goals, or rule descriptions. The agent has to explore, figure out what the point is, form a plan, and execute it. If that sounds like something any toddler can do before breakfast, you're starting to understand the gap between hype and capability.

Previous ARC versions tested static visual puzzles. Labs eventually just threw enough compute and training data at them until the benchmarks were saturated. Version 3 was engineered specifically to prevent that kind of brute-forcing. With 110 of the 135 environments kept private—55 semi-private for API testing, 55 fully locked away—there's no dataset to memorize. You can't overfit your way to victory on novel game logic.

The scoring uses Relative Human Action Efficiency (RHAE), with the baseline set by the second-best human performance. An AI that takes ten times as many actions as a human scores 1% for that level, not 10%. The formula squares the penalty for inefficiency, meaning clumsy bots get punished exponentially—a concept familiar to anyone who's ever paid max gas for a failed transaction.

The best AI agent in a month-long developer preview managed a 12.58% score. Frontier LLMs tested through the official API, with no custom tooling, couldn't even crack 1%. Meanwhile, ordinary humans with no prior training solved all 135 environments, a feat that currently makes us look like the ultimate alpha intelligence.

There is one methodological footnote causing debate. A custom harness pushed Claude Opus 4.6 from its official 0.25% score to 97.1% on a single environment variant called TR87. The official benchmark score remained 0.25%. The official test feeds agents JSON code, not pixel visuals. The foundation argues this isn't the limiting factor, stating that "perception is already sufficient—and the real gap lies in reasoning and generalization," or as we say in crypto, the issue isn't reading the contract, it's understanding the rug.

This AGI reality check landed during a week of peak narrative hype. Besides Huang's comment, Arm named its new data center chip the "AGI CPU." OpenAI's Sam Altman has mused they've "basically built AGI," and Microsoft is marketing a lab focused on building ASI (what comes after AGI). The term is being stretched thinner than a shitcoin's utility, meaning whatever is commercially convenient at the moment.

The ARC Prize 2026 is now offering $2 million across three competition tracks on Kaggle. Every winning solution must be open-sourced. The clock is officially running, and right now, the machines aren't even in the same stadium, let alone on the scoreboard.

AGI? More Like 'A' for Effort: New Benchmark Shows AI Can't Even Play Simple Games Without Instructions

Share Article

Quick Info