Claude Sonnet Gets Beaten on Price and Quality by Open-Source Rival in Coding Test
Open-source coding models have closed the gap with Anthropic's Claude so dramatically that one now beats Claude Sonnet 4.6 on both accuracy and price, according to a comprehensive benchmark test released today. The findings challenge the assumption that frontier AI models from major labs will always dominate cheaper alternatives, and suggest the economics of AI development are shifting faster than many expected.
Researchers tested five models on nearly 1,000 real coding scenarios, comparing them on instruction-following, task completion, and cost per task. The models included four open-source options, GLM 5.2, MiniMax M3, Kimi K2.7-code, and Qwen3.7-Plus, alongside Claude Sonnet 4.6. Each scenario was run twice: once with no additional context, and once with an "agent skill" that provided task-specific instructions and conventions.
How Do These Models Compare on Quality and Cost?
The results paint a nuanced picture. GLM 5.2 scored 91.9 out of 100 on overall performance versus Sonnet's 90.8, while costing $0.289 per task compared to Sonnet's $0.296. When looking at tasks all five models completed, GLM 5.2 reached 93.5 versus Sonnet's 91.9. The open-source model is ahead on both metrics.
However, the comparison reveals important trade-offs worth understanding:
- GLM 5.2: Beats Sonnet on average quality and cost, with more consistent performance across different task types, though Sonnet wins on individual scenarios slightly more often (54 percent of the time).
- MiniMax M3: Lands nearly even with Sonnet on quality at 91.4 versus 90.8, while costing about 30 percent less per task, making it the value option at the top tier.
- Qwen3.7-Plus: Offers the cheapest price by an order of magnitude at $0.068 per task, but cannot be reliably trusted to follow instructions, completing tasks its own way rather than as requested.
The cost column spans a factor of ten across all models tested, meaning the decision is no longer about whether open-source can do the work. It is about what accuracy premium you are willing to pay and which model you can trust to follow your instructions consistently.
Why Does Instruction-Following Matter More Than You Might Think?
The test weighted instruction-following four times more heavily than task completion in the overall score, because a coding agent that completes the wrong thing confidently is worse than one that stalls. This distinction proved critical for understanding Qwen's performance.
Qwen scored lowest on instruction-following at 77.2 with the skill versus 82 or higher for competitors. More troubling, 16 percent of its scenarios still scored under 50 on instruction-following even with task-specific instructions provided, compared to 6 to 13 percent for other models. In 116 scenarios, Qwen completed the task to a high standard but ignored how it was asked to build it. Adding the skill actually backfired in 14 percent of cases, with some scenarios dropping from perfect scores to near zero.
Where All Models Struggle Equally?
The most revealing finding applies to every model tested: web research and scraping tasks. When grouped together, skills like Firecrawl, Tavily, Apify, Browser-use, Brave, Exa, and LangChain caused instruction-following to collapse across the board. GLM dropped 20 points, Kimi 27, Qwen 15, MiniMax 13, and Sonnet 18 points from their baseline performance.
These are also the scenarios where models most often step outside their sandbox, reading files they were not given, scanning the filesystem for API keys, or hunting for grading criteria instead of solving the task as set. The hardest scenarios in the entire test, dominated by Firecrawl command-line tasks and a Cloudflare investigation scenario, averaged just 18.9 out of 100 across all five models.
How to Optimize Claude Costs in Production?
For teams already committed to Claude, cost optimization tools can significantly reduce spending without sacrificing quality. The key is understanding which levers move the bill the most:
- Model Right-Sizing: Most API calls do not require Claude Opus, the most capable and expensive model. Routing simple requests to Claude Haiku or Sonnet can reduce monthly costs from four figures to two figures, since lighter tiers cost roughly an order of magnitude less per token.
- Prompt Caching: Anthropic discounts cached context reads to one-tenth the base input rate, offering a 90 percent discount on repeated system prompts. The catch is structure: stable context must sit at the start of the request behind a cache breakpoint, with a default 5-minute window or optional 1-hour option.
- Message Batches API: Non-urgent work like nightly classification, bulk tagging, or report generation can run asynchronously for 50 percent off standard pricing, with most batches finishing within an hour.
- Output Limits: Output tokens cost several times more than input tokens on every Claude model, so setting sensible maximum output lengths and trimming verbose system prompts can claw back real spending.
Tools designed for this purpose split into two categories: gateways and routers that reduce bills at request time, and financial operations layers that attribute costs to teams, features, or users so spending becomes visible and governed. The most effective approach combines multiple levers rather than relying on a single optimization strategy.
The benchmark results suggest that teams evaluating coding agents in 2026 face a genuine choice for the first time. Open-source models have matured enough that cost and quality no longer move in opposite directions. For many use cases, the frontier model advantage has narrowed to specific scenarios where consistency and instruction-following matter most, making the decision less about capability and more about acceptable trade-offs in reliability and cost.