Google DeepMind's Aletheia AI agent just solved 6 out of 10 novel mathematics research problems in the inaugural FirstProof challenge, powered by their Gemini 3 Deep Think model. This is the kind of AI benchmark that actually matters.
Unlike "passing the bar exam" or "scoring well on standardized tests" - benchmarks that measure AI against problems humans have already solved thousands of times - FirstProof threw genuinely new math problems at the system. These weren't textbook questions with known solutions floating around in training data. These were research-level problems that arose naturally in the work of mathematicians.
The fact that Aletheia operated autonomously is crucial. This wasn't a mathematician using AI as a calculator or proof assistant. The system worked independently within the challenge's time constraints, reasoning through problems and producing solutions that expert mathematicians then validated.
Let's talk about what 6 out of 10 actually means. In mathematics, research problems are legitimately hard. Professional mathematicians can spend months or years on a single problem. An AI agent solving more than half of them autonomously is genuinely impressive progress toward AI that can contribute to mathematical discovery.
The transparency here is refreshing - Google released the raw prompts and outputs on GitHub for anyone to examine. That's a stark contrast to the black-box benchmarks we usually get from AI labs, where we're supposed to trust their numbers without seeing the work.
What makes this technically interesting is the "deep think" aspect. This isn't just about pattern matching or regurgitating learned solutions. The system had to engage in extended reasoning, exploring approaches, recognizing dead ends, and constructing novel proofs. That's closer to actual mathematical thinking than most AI benchmarks capture.
The practical question is what this means for working mathematicians. Is this a tool that augments research, or competition for research positions? My take: for now, it's a powerful assistant. Solving 6 out of 10 is impressive, but mathematicians also need to formulate interesting questions, connect disparate fields, and communicate insights to humans.
