AI Is One Year Away From Passing 'Humanity's Last Exam,' Google DeepMind Says

Artificial intelligence is rapidly closing the gap on a benchmark test specifically designed to measure expertise that only the world's brightest minds can achieve. Researchers believe AI could reach near-perfect scores on "Humanity's Last Exam" (HLE) within the next year, marking a significant milestone in the race toward artificial general intelligence (AGI), a theoretical AI system that matches human-level reasoning across all domains .

The HLE comprises 2,500 questions spanning over 100 highly specialized fields, including mythology, rocket science, and ancient languages. More than 1,000 authorities from the sciences, humanities, and arts contributed to the test, which was deliberately designed to require PhD-level comprehension and remain beyond the reach of current AI systems .

How Has AI Performance on This Test Improved So Quickly?

The speed of AI's progress on the HLE has been remarkable. When ChatGPT first attempted the exam in 2024, it answered fewer than 3% of questions correctly. Within months, Google's Gemini model achieved 18.8% accuracy, demonstrating exponential improvement in reasoning capabilities . This rapid advancement suggests that the gap between AI and human expertise may be narrowing faster than many experts anticipated.

To prevent AI systems from simply memorizing answers, the test creators took extraordinary precautions. They offered a $500,000 prize to experts who could contribute questions that couldn't be easily answered through web searches, eventually receiving over 70,000 responses. Any questions that existing models could answer were discarded until only the most challenging 2,500 remained .

What Makes This Test Different From Other AI Benchmarks?

The HLE stands apart because it measures depth of expertise rather than pattern recognition alone. Test questions might ask AI systems to translate ancient Palmyrene inscriptions or identify microanatomical structures in birds, requiring specialized knowledge that goes beyond what's readily available online . The test creators deliberately kept most answers hidden from public view to prevent future models from memorizing solutions.

  • Specialized Knowledge Requirement: Questions span 100+ fields requiring PhD-level expertise, not just general knowledge
  • Expert-Vetted Content: Over 1,000 authorities from sciences, humanities, and arts contributed questions to ensure rigor and accuracy
  • Anti-Memorization Design: Test creators withheld answers and discarded any questions existing models could answer through web search
  • Exponential AI Progress: Performance improved from 3% to 18.8% in months, suggesting potential for near-perfect scores within one year

Kate Olszewska, a product manager at Google DeepMind, expressed confidence in AI's trajectory. "If we truly cared about this as the only thing in life, I think we could get to it pretty quickly," she stated, suggesting that focused effort could accelerate progress toward mastering the benchmark .

"If we truly cared about this as the only thing in life, I think we could get to it pretty quickly," said Kate Olszewska.

Kate Olszewska, Product Manager at Google DeepMind

What Do Experts Say About What This Achievement Would Actually Mean?

While the prospect of AI achieving near-perfect scores on the HLE is impressive, some researchers caution against interpreting such performance as proof of human-level understanding. Dr. Tung Nguyen, a computer science and engineering professor at Texas A&M who contributed 73 questions to the exam, noted that the gap between AI and human intelligence remains significant despite recent progress .

"When AI systems start performing extremely well on human benchmarks, it's tempting to think they're approaching human-level understanding. But HLE reminds us that intelligence isn't just about pattern recognition, it's about depth, context and specialized expertise," explained Dr. Tung Nguyen.

Dr. Tung Nguyen, Computer Science and Engineering Professor at Texas A&M

Nguyen emphasized that the ultimate goal of the HLE isn't to stump AI, but rather to illuminate both the strengths and weaknesses of these systems. Understanding where AI excels and where it falls short can help researchers build "safer, more reliable technologies" while demonstrating "why human expertise still matters" in an era where AI is increasingly replacing human workers across industries .

Nguyen

Interestingly, recent research has revealed that AI systems may develop reasoning processes that mirror human cognition in unexpected ways. In 2025, tests by Chinese researchers found similarities between how AI models "perceive" information and how human brains process language and concepts. The analysis showed strong alignment between AI model embeddings and neural activity patterns in brain regions associated with memory and scene recognition, suggesting that AI systems might be developing human-like conceptual representations of objects .

As AI continues to advance, the HLE serves as a crucial checkpoint for measuring progress toward artificial general intelligence. Whether AI reaches near-perfect scores on the exam within the next year remains to be seen, but the trajectory is clear: the machines are getting smarter, faster than most observers expected.