Logo
FrontierNews.ai

Claude Opus Outperforms GPT-5.5 in Real-World Cybersecurity Tasks, New Benchmark Reveals

Anthropic's Claude Opus 4.6 significantly outperformed OpenAI's GPT-5.5 in detecting cybersecurity threats, according to the first comprehensive benchmark testing how large language models (LLMs) handle real-world attack investigations. The study, conducted by Simbian, tested 12 frontier models from four major AI providers on 26 diverse real Windows attack campaigns, revealing that general intelligence benchmarks don't predict security performance.

Which AI Model Actually Performs Best at Detecting Cyberattacks?

The benchmark measured three critical factors: how well each model detected malicious activity (coverage score), the cost per investigation, and how quickly it completed the task. Claude Opus 4.6 held the performance ceiling at 45% coverage across attack tactics, significantly outpacing GPT-5.5, which achieved materially worse coverage despite similar per-token pricing. The study tested models from Anthropic, including Opus 4.6, Opus 4.7, and Sonnet 4.6, alongside OpenAI's GPT-5 and GPT-5.5, Google's Gemini models, and several open-weight alternatives.

The findings challenge conventional wisdom about AI model selection. A model's performance on generic knowledge benchmarks like MMLU (Massive Multitask Language Understanding) doesn't translate to cybersecurity effectiveness. Instead, the research found that how a model approaches investigation matters more than raw processing power. Opus 4.7, for instance, completed investigations in roughly 30 SQL database queries versus Sonnet and Opus 4.6's approximately 50 queries, allowing it to finish faster and cheaper despite lower absolute performance.

Why Does Cost Prediction Fail for AI Security Tools?

One surprising finding emerged when researchers compared predicted costs based on per-token pricing against actual deployment costs. The correlation was only 80%, meaning one-fifth of cost variation comes from factors beyond simple token pricing. Prompt caching, a feature that reuses previous conversation context instead of reprocessing it, dramatically affects real-world expenses. Without caching enabled, costs grow exponentially as investigations lengthen, creating unpredictable bills that arrive quickly.

The research revealed that inference speed, verbosity, and investigation style all interact in ways that token pricing alone cannot predict. GPT-5.5 generates tokens roughly 1.5 times faster than Anthropic's models at the same tier, achieving approximately 90 tokens per second versus Anthropic's 60 tokens per second. However, this speed advantage was partially offset by GPT-5.5's tendency to emit more tokens per turn, resulting in wall-clock investigation times of roughly 200 seconds compared to Opus 4.7's 250 seconds.

How to Select the Right LLM for Your Security Operations

  • Benchmark on your actual workload: Generic benchmarks like MMLU and GPQA don't generalize to security tasks; coding benchmarks are the closest proxy available, so test models on real attack scenarios relevant to your environment.
  • Validate agent behavior, not just final answers: Security investigations require step-by-step reasoning; scoring only the final answer allows models to guess correctly without performing actual investigation work.
  • Verify prompt caching is enabled: Confirm your provider has prompt caching activated before deploying multi-turn agents in production, as uncached deployments generate quadratic cost growth with conversation length.
  • Score against ground truth: Use objective scoring against known attack patterns rather than LLM-as-judge evaluation, which is unreliable, expensive, and biased toward the model being tested.
  • Ask vendors about model changes: Inquire what model a security product uses today, what it used six months ago, and whether they re-benchmark when underlying models change, since harness design and context architecture matter as much as the model itself.

The benchmark tested models across four provider groups: Anthropic (Opus 4.6, Opus 4.7, Sonnet 4.6), OpenAI (GPT-5, GPT-5.5), Google (Gemini 3.1 Pro, Gemini 3 Flash), and open-weight models including DeepSeek 4 Pro, Kimi 2.6, Minimax 2.7, Nemotron 3 Super, and Qwen 3.6 Plus. Each model ran autonomously inside a ReAct harness with access to a SQL log database, creating a provider-agnostic testing environment.

A critical insight emerged about model optimization trends. Newer flagship models are optimizing for cost and speed rather than detection accuracy. Opus 4.7 is cheaper and faster than Opus 4.6 but achieves lower performance. GPT-5.5 is faster than Opus 4.7 but more expensive, with lower coverage than either Opus model. This trade-off pattern suggests that vendors are prioritizing operational efficiency over security effectiveness, a shift that enterprises should monitor carefully.

The research cost approximately $1,800 USD to conduct, representing the investment required to move beyond guesswork in LLM selection. For organizations without time or budget for comprehensive benchmarking, the study offers an empirical guideline: per-token pricing explains roughly 80% of cost variance, making it a reasonable first-cut screening tool. However, this shortcut fails at the flagship tier where model behavior diverges most from simple pricing predictions.

The implications extend beyond vendor selection. Security teams deploying AI agents for threat hunting must understand that the harness, context architecture, and engineering around the model matter as much as the model itself. A follow-up study plans to evaluate Claude Code and other commercial harnesses, which may reorder model rankings by hiding tool overhead behind different prompt engineering approaches.