Claude Opus 4.6 Is Getting Worse, and Anthropic Isn't Saying Why

FrontierNews.ai AI Research Desk

Claude Opus 4.6 Is Getting Worse, and Anthropic Isn't Saying Why

Anthropic's flagship Claude Opus 4.6 model has quietly degraded in performance over the past several weeks, according to hundreds of paying customers and independent performance monitoring. Users paying between $20 and $200 monthly report the model failing tasks it previously handled consistently, while third-party benchmarks show measurable declines. The issue mirrors a similar degradation crisis from September 2025 that Anthropic only acknowledged after public pressure.

What Are Users Actually Experiencing With Claude Opus 4.6?

The complaints follow a strikingly consistent pattern across Reddit, X (formerly Twitter), and GitHub issue trackers. Developers describe a model that feels "dimmer," more prone to circular reasoning, and less capable than weeks before. One Claude Code user with an enterprise subscription wrote: "Claude Code for the first time in 2 years did not recognize it had a native Plan Mode. Nor did it know how to activate it. I've been a massive advocate for Claude Code. Brought it to my org, enterprise subscription, been paying for it since I could. Now this? This is garbage".

Another developer ran controlled tests comparing Opus 4.6 to the earlier Opus 4.5 version and found that Opus 4.6 now fails benchmarks it previously passed consistently, while Opus 4.5 still passes them. The user concluded: "Switched back to 4.5 on Claude Code and what a difference. Feels like I got Opus back finally. The untransparent nerfing is absolutely ridiculous and makes me think about canceling my Max plan".

Perhaps most striking are reports from users who loaded identical coding projects and used the exact same prompts that worked the previous week, only to receive dramatically inferior results. One frustrated subscriber paying 110 euros monthly stated: "How is this sh*t even legal??? I'm paying 110€ a month for an AI that at this point is on the level of a support chatbot".

How Are Independent Monitors Tracking the Decline?

Anecdotal complaints alone might be dismissed as user error or unrealistic expectations. But Marginlab, an independent third-party organization with no affiliation to Anthropic or competing AI providers, has been running daily SWE-Bench-Pro benchmark evaluations of Claude Code with Opus 4.6 since the degradation complaints began. Their findings provide concrete numbers.

Marginlab's data shows a baseline pass rate of 56% established as the historical reference point. As of April 10, 2026, that rate had slipped to 50% on their most recent daily evaluation, representing a 6 percentage point drop. While Marginlab notes this individual daily reading is not yet statistically significant at their threshold due to the smaller daily sample of 50 test cases, the downward trend is notable.

The fact that a third party has set up independent monitoring specifically because they cannot trust Anthropic's own reporting is itself significant. Marginlab explained their methodology with full transparency: "We benchmark in Claude Code CLI with the SOTA model (currently Opus 4.6) directly, no custom harnesses. What you see is what you get." This means their results reflect the actual experience a real user would have, not an idealized lab environment.
Marginlab

Steps to Assess Your Own Claude Performance

Run Baseline Tests: Document specific prompts and tasks that Claude handled well previously, then re-run them with identical inputs to compare outputs and identify performance changes.
Compare Model Versions: Test the same task with both Opus 4.6 and Opus 4.5 side-by-side to determine if performance differences are consistent across versions.
Monitor Context Window Behavior: Track whether Claude's performance degrades earlier than expected within long conversations, particularly noting any self-reported quality drops before reaching the advertised context limit.
Check Independent Benchmarks: Review third-party performance dashboards like Marginlab and LMArena to see how Opus 4.6 ranks against competing models on standardized tasks.

Why Does This Pattern Keep Repeating?

This is not the first time Anthropic has faced widespread degradation complaints. Between August and early September 2025, users flooded Reddit and social media with reports of dramatically degraded Claude performance. Subreddits like r/ClaudeCode devolved into daily threads of broken behavior, with users switching to competing services. For weeks, Anthropic said nothing. Only after Sam Altman quote-tweeted a screenshot of the r/ClaudeCode subreddit did an incident post appear.

Anthropic eventually published a detailed engineering postmortem in September 2025, acknowledging three separate infrastructure bugs that had degraded Claude's responses across multiple models over a period of weeks. The bugs included a context window routing error that at peak affected 16% of all Sonnet 4 requests, a TPU misconfiguration that caused output corruption including random Thai and Chinese characters appearing in English responses, and an XLA:TPU compiler miscompilation bug affecting token probability calculations.

In that postmortem, Anthropic stated clearly: "To state it plainly: We never reduce model quality due to demand, time of day, or server load." That's a meaningful statement. Yet as one observer noted in response: "Anthropic says they 'never intentionally degrade model quality.' Maybe. Users don't experience intent; we experience results. Quality dropped. Communication dropped to zero. Only after a public shaming did we get the tidy 'two bugs, resolved' line".
Anthropic

What Evidence Suggests Intentional Degradation?

Several data points raise questions about whether the current decline is accidental or deliberate. One developer documented a session using the 1M context version of Opus 4.6 in which the model itself reported declining performance, eventually telling the user at just 48% context capacity: "I'm deep enough in this context that I'm not being effective." This occurred despite the advertised capability supporting the full million tokens. The model was losing track of decisions, engaging in circular reasoning, and contradicting itself, all before hitting half its supposed context limit.

Additionally, performance monitoring on LMArena shows Claude Opus 4.6 is no longer in the top 10 on the leaderboard for vision tasks, with Gemini 3 Pro now sitting at number one with an Elo rating of 1288. This represents a significant drop in competitive standing among frontier models.

The timing and consistency of the complaints, combined with Anthropic's silence and the pattern of previous undisclosed degradations, has raised a fundamental consumer rights question: when you pay a subscription fee for a specific AI model, do you have any right to actually receive that model in the condition you purchased it? The September 2025 postmortem was technically thorough, but it came weeks after degradation began and only after significant public pressure. Users had been paying premium subscription rates throughout the entire period of degraded service without any notification or compensation.

Your AI & Tech News Engine

Breaking News

Elon Musk's xAI Takes Grok on the Road: Why a Tesla Showroom Became an AI Pitch Stage

Why Tech Giants Are Suddenly Pumping the Brakes on AI Spending

Grok Build Brings AI Coding to Your Mac's Terminal: No Programming Knowledge Required

xAI's $6 Billion Loss Problem: Why Elon Musk's AI Bet Isn't Paying Off Yet

AI Agents Are Failing at Real-World Tasks: Here's What Amazon and Huawei Just Discovered

The One-Sentence ChatGPT Hack That Makes AI Advice Actually Useful for Real Life

Alibaba's Qwen vs. DeepSeek: The Chinese AI Showdown That's Reshaping Model Economics

SpaceX's $2 Trillion IPO Is Coming, But There's a Hidden Complexity Problem for Regular Investors

Claude Opus 4.6 Is Getting Worse, and Anthropic Isn't Saying Why

What Are Users Actually Experiencing With Claude Opus 4.6?

How Are Independent Monitors Tracking the Decline?

Steps to Assess Your Own Claude Performance

Why Does This Pattern Keep Repeating?

What Evidence Suggests Intentional Degradation?