OpenAI's o-Series Models Struggle in Finance Benchmarks, Despite Premium Pricing
OpenAI's o-series reasoning models, despite their premium pricing, are significantly underperforming cheaper competitors on complex financial reasoning tasks. A comprehensive benchmark of over 40 large language models (LLMs) tested on 238 challenging financial questions reveals that o1-pro costs $381 per run while achieving only 80.67% accuracy, and o3-pro costs $39.23 for 78.15% accuracy. Both models fall well below the performance of Claude Fable 5, which reaches 90.34% accuracy at just $10.05 per run.
Why Are OpenAI's Reasoning Models Underperforming in Finance?
The FinanceReasoning benchmark, which evaluates models on complex multi-step quantitative reasoning involving financial concepts and formulas, exposes a critical gap between OpenAI's o-series positioning and real-world performance in specialized domains. The o1 model achieved 74.79% accuracy at a cost of $46.59, placing it below even budget-tier models like gpt-oss-120b, which reached 81.09% accuracy for just $0.06. This dramatic cost-to-performance ratio suggests that reasoning models optimized for general problem-solving may not translate effectively to domain-specific financial analysis.
The benchmark tested models on tasks requiring sophisticated financial analysis, including statement analysis, forecasting, and ratio calculations. These tasks demand iterative algorithmic logic, such as calculating exponential moving averages and average true range indicators for technical analysis. The results indicate that while OpenAI's reasoning architecture excels in certain domains, it may not be the optimal choice for financial institutions evaluating AI tools for quantitative work.
How to Evaluate AI Models for Financial Tasks
- Accuracy on Domain-Specific Benchmarks: Test models on specialized financial reasoning benchmarks rather than general knowledge tests, as performance can vary dramatically across domains. The FinanceReasoning benchmark specifically targets multi-step quantitative problems that mirror real financial analysis work.
- Total Cost Per Task: Calculate both input and output token costs, since pricing structures vary significantly between providers. A model with lower per-token rates may deliver better value than a premium-priced alternative, even if accuracy is slightly lower.
- Token Efficiency: Compare output token consumption alongside accuracy. Some models achieve similar results with vastly different token counts; for example, gemini-3-flash-preview uses 118,530 tokens to reach 83.61% accuracy, while kimi-k2.5 consumes 877,868 tokens for 82.77% accuracy, making the former 7.4 times more efficient.
Which Models Actually Lead in Financial Reasoning?
Claude Fable 5 emerged as the clear leader, becoming the first model to surpass 90% accuracy on the benchmark at 90.34%. Claude Opus 4.8 follows closely with 89.08% accuracy and significantly lower token consumption at 113,434 tokens, making it the cheapest option to clear the 88% accuracy threshold at $3.28 per run. GPT-5 (dated 2025-08-07) achieved 88.23% accuracy but required substantially more tokens at 829,720, resulting in higher costs.
The benchmark reveals that token consumption does not correlate with accuracy. DeepSeek R1, which consumed the most tokens of any model tested at 1,251,064, achieved only 62.18% accuracy. Meanwhile, Claude Opus 4 (dated 2025-05-14) scored 80.25% accuracy with just 132,274 tokens, demonstrating that architectural efficiency matters more than raw computational throughput.
For organizations with tighter budgets, the results suggest viable alternatives. GPT-OSS-120b reached 81.09% accuracy for $0.06 total spend, placing it within striking distance of the 83% accuracy threshold at less than 1% of the cost of frontier models. Llama 4 Maverick achieved 75.21% accuracy for $0.10, offering a practical option for workloads where 80% accuracy is sufficient.
What Does This Mean for OpenAI's Market Position?
The benchmark results suggest that OpenAI's o-series models may be optimized for general reasoning tasks rather than specialized financial analysis. While reasoning models are designed to "think longer" before answering, this approach appears to add cost without proportional accuracy gains in the financial domain. Organizations evaluating AI tools for quantitative finance may find better value in Claude's latest models or even open-source alternatives, depending on their accuracy requirements and budget constraints.
The findings also highlight a broader trend in AI development: specialized benchmarks reveal performance gaps that general-purpose evaluations often mask. As enterprises increasingly deploy AI for domain-specific work, the gap between marketing claims and real-world performance in specialized tasks becomes more consequential. For financial institutions, this benchmark provides concrete evidence that premium pricing does not guarantee superior performance on the tasks that matter most to their operations.