AI Still Can't Do the Math: Models Faceplant on Reading Charts, Remain Dangerously Unprepared for DeFi Spreadsheets

Forget about achieving Artificial General Intelligence—today's top AI models still get their math absolutely bodied the moment it's presented in a simple chart. A crew of researchers from Microsoft Research, Sahara AI, and Emory University put 12 foundation models, including the usual suspects like ChatGPT, Gemini, and Claude, through the wringer on the MATHVISTA benchmark. This test checks if they can actually reason mathematically using visual intel like graphs and diagrams, not just regurgitate text.

The highest score, courtesy of GPT-4 Vision, was a frankly pathetic 49.9%. For comparison, your average human participant managed a 60.3%. That's not a narrow gap; it's a chasm wide enough to drive a truck full of failed trading bots through, highlighting just how far current AI is from the flexible reasoning we associate with true AGI.

“We want the machine to do things that a normal, average person can do for their daily tasks,” explained Microsoft's Principal Researcher Hao Cheng. The project essentially asks if these models can look at a chart and solve a multi-step problem—a skill that requires more than just fancy autocomplete and actually understanding the world, a concept still foreign to most silicon brains.

So what's the core malfunction? It turns out many older evaluation datasets were full of problems a clever model could solve using text patterns alone, completely sidestepping the need for any visual reasoning. “Which is not ideal,” Cheng dryly noted, in what might be the understatement of the year in AI research.

AGI itself remains a hilariously fuzzy finish line. Tech executives confidently predict it, venture capitalists throw billions at it, and doomers warn about its existential risks, while the actual researchers can't even agree on what "it" is or when it might show up to the party.

The MATHVISTA benchmark, which launched in October 2023, has already been downloaded over 275,000 times. Building it wasn't a weekend project; it required annotators who were actually skilled in arithmetic, algebra, geometry, and statistics to craft problems that test deep reasoning, not just the ability to spot a pie chart.

For this endeavor, Microsoft teamed up with Sahara AI, which supplied the trained human annotators and rigorous quality checks to produce a dataset of over 6,000 multimodal examples. Think of them as the meticulous auditors checking the AI's homework before it gets graded.

Benchmarking these models is a minefield due to “data contamination,” as Sahara AI CEO Sean Ren points out. If the benchmark's answers are already lurking in a model's training data, a high score might just mean it's good at memorization, not that it has suddenly learned to think—a critical difference when you're trusting it with anything more important than generating memecoins.

Elon Musk, never one for modest predictions, recently claimed a 10% probability of achieving AGI with xAI's upcoming Grok 5 model. He argued the key is live data from his platform X, not static training sets, calling this real-time access his primary "competitive edge." It's a bold claim, considering the current state of the art can't reliably read a bar graph.

Researchers also highlight the fundamental limits of training data. “You definitely need to have some way to inject some of the new knowledge into this process,” Cheng said, suggesting that a flood of high-quality, novel data is needed to break the current "knowledge boundary" that models keep hitting like a brick wall.

One theoretical path forward involves building simulated environments where models can interact and learn from experience, like a video game for AIs. “You create a twin world... so the model can play... and basically break the boundary of the internet,” Cheng explained. So, we're training

AI Still Can't Do the Math: Models Faceplant on Reading Charts, Remain Dangerously Unprepared for DeFi Spreadsheets

Share Article

Quick Info