Logo
FrontierNews.ai

GPT 5.5 with OpenAI's Codex Framework Wins Real-World AI Agent Test, Beating Claude Despite Lower Benchmark Scores

GPT 5.5 paired with OpenAI's Codex framework has emerged as the top performer in a new real-world AI agent benchmark, achieving a 24% pass rate on complex professional tasks like 3D modeling and special effects work, significantly outpacing competitors despite lower scores on traditional AI benchmarks. The finding challenges the assumption that models with the highest test scores automatically translate to the best practical performance.

What Is the New "Agents' Last Exam" Benchmark?

UC Berkeley researchers released a benchmark called Agents' Last Exam (ALE) that tests AI agents on actual professional work rather than multiple-choice questions. The benchmark includes tasks from 55 different industries, including quantitative trading, aerospace engineering, architectural design, and animation special effects. Each task comes from real projects already completed by domain experts, covering work that typically takes humans anywhere from a few hours to several weeks to finish.

The test format differs fundamentally from previous AI benchmarks. Instead of answering written questions, agents receive full computer access through a framework called GCUA (Generalist Computer-Use Agent), allowing them to use mouse clicks, keyboard input, scripts, and web browsing to complete tasks. The benchmark includes building 3D models in Siemens NX, creating game scenes in Unreal Engine, and performing special-effects compositing in Adobe After Effects.

More than 300 domain experts from over 100 institutions contributed to creating the test questions, including researchers from MIT, Harvard, Stanford, and Oxford, as well as professionals from Goldman Sachs, JPMorgan, Meta, Amazon, and Adobe. The benchmark keeps approximately 90% of its questions secret and rotates them regularly to prevent models from simply memorizing answers.

Why Did GPT 5.5 with Codex Outperform Claude Fable 5?

The results surprised many observers in the AI community. GPT 5.5 paired with OpenAI's Codex framework achieved the highest pass rate at 24.0%, followed by GPT 5.5 with the ALE Claw baseline framework at 23.0%. Claude Fable 5, Anthropic's newest model released just weeks before the benchmark, came in third place with a 22.0% pass rate when paired with Claude Code.

The performance gap becomes more striking when examining cost and efficiency metrics. The Codex framework completed all tasks for just $566, while Claude Fable 5 spent $2,315 to achieve a lower score, meaning Fable 5 cost more than four times as much for inferior results. This efficiency advantage highlights how framework choice and model pairing significantly impact real-world agent performance.

Time requirements also varied dramatically across different configurations. Some frameworks completed the benchmark in under 100 hours, while others required multiple weeks, despite spending substantially more money and achieving lower scores. This efficiency variation suggests that the combination of model and framework matters far more than raw model capability alone.

Why Do Traditional Benchmarks Miss Real-World Performance?

The disconnect between traditional benchmark scores and real-world performance reveals a critical gap in how the AI industry evaluates progress. Claude Fable 5 had dominated previous benchmarks, scoring 80.3% on SWE-Bench Pro compared to GPT 5.5's 58.6%, and 64.5% on Humanity's Last Exam versus 52.2%. Yet in the Agents' Last Exam, the rankings reversed.

The highest overall score on ALE was only 45.8%, and even the champion achieved just a 24% pass rate on the full test. At the most difficult level, both Claude Fable 5 and GPT 5.5 scored zero, along with most other mainstream configurations. This suggests that even the strongest AI agents remain far from matching human expert performance, which theoretically reaches 100% on these already-completed projects.

The benchmark's design prevents the scoring manipulation that has plagued previous evaluations. All submissions are automatically scored by deterministic code with no human judgment involved, and the rotating secret question pool prevents models from achieving high scores through memorization alone. At the most difficult task level, the average pass rate across all mainstream configurations was only 2.6%.

How to Evaluate AI Agent Frameworks for Your Use Case

  • Test on Real Tasks: Evaluate frameworks using actual work from your industry rather than relying solely on published benchmark scores, since real-world performance often differs significantly from test results.
  • Calculate Total Cost of Ownership: Compare not just accuracy but also the total expense per task completion, including model costs, API calls, and computational overhead, as efficiency varies dramatically between frameworks.
  • Measure Time to Completion: Assess how long agents take to finish tasks, since some frameworks complete work in hours while others require days or weeks for identical assignments.
  • Verify Reproducibility: Ensure the framework produces consistent, deterministic results that can be verified automatically rather than relying on subjective human evaluation.

What Does This Mean for AI Agent Development?

The Agents' Last Exam results suggest that the AI industry's focus on model size and benchmark scores may be misaligned with practical needs. OpenAI researchers celebrated the Codex framework's performance, while many observers who previously predicted that AI agents would replace human workers in 2026 or 2027 have become noticeably quieter about those timelines.

"There are predictions everywhere that AI agents will surpass humans in almost all jobs between 2026 and 2027. So we created this exam to verify this claim," explained Yiyou Sun, core author of the benchmark.

Yiyou Sun, Core Author, Agents' Last Exam

The benchmark's creators explicitly designed ALE to test whether AI agents could match human performance in real professional work. The results indicate substantial gaps remain, even for the most advanced models available today.

Meanwhile, the broader AI industry continues expanding agent capabilities. Coinbase recently launched AI tools allowing agents to access user accounts for trading and payments through both Model Context Protocol (MCP) integration and command-line interfaces. The company plans to introduce configurable rules covering maximum trade size, permitted assets, and spending caps, with integration into x402, an agentic payments protocol, enabling agents to pay for data and services.

"While most platforms allow agents to trade, Coinbase enables both trading and payments, positioning the company as infrastructure for the agentic economy rather than simply another brokerage with an added bot," noted Lincoln Murr, Head of AI Product at Coinbase.

Lincoln Murr, Head of AI Product at Coinbase

This expansion reflects growing investment from financial institutions preparing for agent-initiated transactions at scale, with Visa, Mastercard, and other payment processors developing similar agentic-commerce initiatives. The gap between benchmark performance and real-world capability suggests that future AI agent development will likely focus less on raw model intelligence and more on framework design, task-specific optimization, and cost efficiency. The Agents' Last Exam demonstrates that a good test-taker doesn't necessarily make a good doer, a principle that applies equally in the AI world.