OpenAI's GPT-5.5 Undercuts Claude on Price While Winning on Coding Tasks

FrontierNews.ai AI Research Desk

OpenAI's GPT-5.5 Undercuts Claude on Price While Winning on Coding Tasks

OpenAI has released GPT-5.5, a new frontier model that costs significantly less than its closest competitor while leading on critical coding benchmarks. The model launched on April 23, 2026, to ChatGPT Plus, Pro, Business, and Enterprise subscribers, with API access following the next day. GPT-5.5 scores 82.7% on Terminal-Bench 2.0, a benchmark measuring autonomous developer workflows, and costs just $5 per million input tokens and $30 per million output tokens. This pricing represents a dramatic shift in the economics of frontier AI models, potentially reshaping how startups and enterprises choose their AI infrastructure.

How Does GPT-5.5 Compare to Claude and Other Leading Models?

The benchmark landscape has fragmented into specialty leaders rather than one dominant model. GPT-5.5 takes the top spot on three major evaluations, but the competition remains fierce across different task categories. Here is how the frontier models stack up on key metrics:

Terminal-Bench 2.0 (Autonomous Coding): GPT-5.5 leads with 82.7%, compared to Claude Opus 4.7 at 69.4% and Gemini 3.1 Pro at 68.5%, a 13.3-point advantage that independent reviewers have verified as significant for long-running shell-based tasks.
SWE-Bench Pro (Real GitHub Issues): Claude Opus 4.7 retains the edge at 64.3% versus GPT-5.5's 58.6%, a 5.7-point gap that matters for enterprises standardizing on a single coding agent for production work.
GDPval (General Task Performance): GPT-5.5 scores 84.9%, outperforming Claude Opus 4.7 at 80.3% and Gemini 3.1 Pro at 67.3%, demonstrating broad capability gains across diverse workloads.
OSWorld-Verified (Computer Use): GPT-5.5 achieves 78.7%, a metric that measures the model's ability to interact with software interfaces autonomously.

The Terminal-Bench lead is the most material finding. Terminal-Bench 2.0 evaluates agent performance on long-running shell-based tasks, including installing dependencies, debugging environment errors, recovering from failed commands, and producing working artifacts. A 13-point lead at this level of saturation is unusual, and several independent reviewers verified the result within 48 hours of release.

Why Is the Pricing Reset the Real Story?

OpenAI's pricing strategy represents a quiet but consequential reset in the frontier model market. GPT-5.5 maintains input token pricing at $5 per million tokens, identical to GPT-5.4, while output tokens cost $30 per million, only modestly higher than the previous generation's $25. This flat pricing despite double-digit benchmark gains is striking because it effectively lowers the cost of completing equivalent work.

The comparison to Anthropic's Claude Opus 4.7 is the one OpenAI clearly designed for. Claude Opus 4.7 costs $15 per million input tokens and $75 per million output tokens, meaning GPT-5.5 lists at one-third the input cost and 40% of the output cost while leading on Terminal-Bench, GDPval, and OSWorld-Verified benchmarks. Early users estimate that token-efficiency improvements of roughly 40% fewer output tokens on coding tasks mean the effective cost of completing equivalent agentic coding work is closer to a 20% increase than the doubling the headline price would suggest.

For startups building coding agents, customer-support automation, or document-processing pipelines that consume hundreds of millions of tokens per week, the unit-economics gap is now large enough to drive switching even where Claude retains a benchmark edge. Three early-stage AI infrastructure startups, Factory, Cognition, and Magic, confirmed they were running GPT-5.5 in production evaluation pipelines by the morning of April 24.

What Are the Key Improvements in GPT-5.5's Capabilities?

OpenAI has explicitly moved away from positioning new releases around academic benchmark scores like MMLU and GPQA, both of which have effectively saturated for frontier models, and toward agentic evaluation suites that measure end-to-end task completion. The launch materials emphasize that GPT-5.5 understands tasks earlier, asks for less clarification, uses tools more efficiently, checks its own work, and "keeps going until it is done".

Safety improvements are quantified in the system card. GPT-5.5's individual factual claims are reportedly 23% more likely to be correct than the comparison baseline, and full responses contain a factual error roughly 3% less often. The model also ships with a more aggressive personalization layer inside ChatGPT, including richer memory integration, persistent access to past chats, and connected services like Gmail, Google Drive, and Microsoft 365 where users opt in.

OpenAI structured the April 23 announcement around three product tiers rather than a single flagship. GPT-5.5 Instant is the default model surfaced inside ChatGPT for free, Plus, and Pro users, a routing-layer model that responds quickly with light reasoning. GPT-5.5 Thinking is the deliberative variant exposed in the model picker for problems that benefit from extended chain-of-thought reasoning. GPT-5.5 Pro, the most expensive tier at $30 input and $180 output per million tokens, is reserved for ChatGPT Pro subscribers and Enterprise tiers, plus an API endpoint targeted at research labs, quant funds, and law firms running long-form analysis pipelines.

Where Does GPT-5.5 Still Fall Short?

The SWE-Bench Pro loss to Claude Opus 4.7 is the launch's most acknowledged weakness. Anthropic's model resolves real GitHub issues end-to-end at 64.3% to OpenAI's 58.6%, a 5.7-point delta that matters for enterprises standardizing on a single coding agent for production software development. Additionally, Gemini 3.1 Pro retains its lead on autonomous web research with a BrowseComp score of 85.9% versus GPT-5.5's 84.4%, meaning Google's model remains the volume play for high-context retrieval workloads.

Google's Gemini 3.1 Pro also remains competitive on price. At $2.50 input and $15 output per million tokens with a 2-million-token context window, it is still the cheapest frontier model for high-context retrieval workloads, and its BrowseComp lead means autonomous web research agents continue to default to it. For teams building web-research-focused agents, the cost and capability combination still favors Google's offering.

What Should Developers and Enterprises Watch Next?

The GPT-5.5 launch signals that the frontier has fragmented into specialty leaders rather than one model dominating every test. Developers and enterprises should evaluate models based on their specific use case rather than assuming a single model will excel across all tasks. For autonomous coding and long-horizon reasoning, GPT-5.5's Terminal-Bench lead and aggressive pricing make it a compelling choice. For real-world GitHub issue resolution and tool coordination, Claude Opus 4.7 retains an edge. For autonomous web research and high-context retrieval, Gemini 3.1 Pro remains the most cost-effective option.

OpenAI has signaled that GPT-6 is expected later in 2026, meaning the model race will continue to accelerate. The pricing reset and benchmark fragmentation suggest that future releases will compete on both raw capability and unit economics, with different models optimized for different workloads rather than a single frontier model serving all use cases.

Your AI & Tech News Engine

Breaking News

Jensen Huang Just Handed AMD a $200 Billion Opportunity. Here's Why Wall Street Is Paying Attention.

Sundar Pichai on Why Google's Search Overhaul Will Reshape the Web,Again

The Real AI Bottleneck Isn't Chips Anymore,It's Power Infrastructure

AI Data Centers Face a Water Crisis: Cornell Study Maps the Environmental Reckoning Ahead

The Musk-Altman AI Showdown: Why Their Trillion-Dollar IPO Race Matters Beyond Silicon Valley

Google's Antigravity Faces a Trust Crisis as Developers Flee to Claude Code and Cursor

Moore's Law Is Dying. Here's What Comes Next, According to Huawei and NVIDIA

Why Grok Is Winning the AI Bias Debate While ChatGPT and Claude Face Scrutiny

OpenAI's GPT-5.5 Undercuts Claude on Price While Winning on Coding Tasks

How Does GPT-5.5 Compare to Claude and Other Leading Models?

Why Is the Pricing Reset the Real Story?

What Are the Key Improvements in GPT-5.5's Capabilities?

Where Does GPT-5.5 Still Fall Short?

What Should Developers and Enterprises Watch Next?