There’s an old cartoon that’s a favourite of mine: Two researchers stand at a blackboard, surrounded by dense equations. In the middle of it all, one has scribbled, “Then a miracle happens.” The other points and says, “I think you should be more explicit here.”
It’s a funny take on a real issue — the temptation to skip over the hardest step with a hand wave and hope no one notices.
Today, large language models (LLMs) like ChatGPT are making similar leaps. They can start a solution, cite the right formulas, sometimes even finish with a correct-looking answer. But in the middle — where actual understanding, creativity, or deep reasoning should happen — things get fuzzy. That miracle moment still eludes them.
A recent experiment by a group of top mathematicians, called First Proof, puts this to the test in a practical way. Their aim: see how A.I. performs not on textbook math, but on unpublished, research-level problems. Problems where the answer isn’t already out there — and where genuine insight is needed.
Here’s why their findings matter to anyone working in research-driven fields, advanced analytics, or AI strategy.
What First Proof Tried to Do
The team — including Fields Medalist Martin Hairer and professors from Stanford, Harvard, and MathSci.ai — created a testing ground for A.I. that reflects how math is really done. Each mathematician submitted a problem from their own active research, along with a known solution, held back until results are collected.
The goal wasn’t to trick the models — it was to find out what they can really do without help.
They tested advanced systems like GPT-5.2 Pro and Google’s Gemini 3.0. The problems were designed to be understandable to the model (no diagrams, no endless responses), but still required original thought.
What They Found
1. The “miracle” moment is still missing.
LLMs could generate lots of words — often correct ones — but when it came to the real leap, the key insight or new connection, they often glossed over it. As Hairer put it, “They know where they’re starting, and where they want to end, but not how to get there. So they just write ‘therefore’ and move on.”
2. They’re better at mimicking than reasoning.
The models do well at reassembling known arguments, even stitching together steps from different examples. But when asked to apply that logic in a truly unfamiliar setting, they stumble or repeat themselves.
3. Without human oversight, the risk of nonsense grows.
Several tests showed the models giving wrong answers with full confidence, or going in circles second-guessing themselves. As Williams noted, they can sound persuasive even when they’re off-track — which can be more dangerous than silence.
What This Means for Business and Research Leaders
For companies investing in AI to enhance R&D, engineering, finance, or biotech workflows, the takeaway is practical:
-
AI is a support tool, not a replacement for domain expertise.
It can accelerate some tasks, but it still needs humans to guide and verify critical thinking. -
Model output isn’t self-validating.
Just because an answer is long, well-written, or plausible doesn’t make it correct. In high-stakes areas — like pricing models, scientific design, or legal analysis — that distinction matters. -
The path to automation still requires judgment.
Especially in areas that involve uncertainty, abstraction, or novel problem-solving, LLMs aren’t ready to operate solo.
A Useful Reality Check
The First Proof project isn’t anti-AI. In fact, the researchers welcome better tools — they just want to know what’s hype and what’s real. Their hope is to create a more honest benchmark: one that helps A.I. evolve by confronting its limits.
It’s a helpful reminder for all of us using these systems: they can be powerful, fast, even impressive — but they’re not magicians. And when you see the phrase “and then a miracle happens,” that’s your cue to slow down and ask better questions.
Because in math, as in business, you can’t afford to build on steps you don’t understand.