Logo
FrontierNews.ai

Why AI Safety Experts Are Sounding the Alarm on Claude and ChatGPT in Security Roles

Anthropic's Claude Opus and OpenAI's ChatGPT are increasingly used to help secure critical computer systems, but a new study warns these AI assistants can fabricate dangerous security advice when prompted adversarially. Researchers evaluated how well these large language models (LLMs), which are AI systems trained on vast amounts of text data, perform when advising on Trusted Execution Environments (TEEs), specialized security zones that protect sensitive computations from compromised systems. The findings suggest that while AI holds promise for security work, it requires careful oversight to prevent catastrophic failures.

What Are Trusted Execution Environments and Why Do They Matter?

Trusted Execution Environments are like digital vaults built into processors. Technologies such as Intel SGX and ARM TrustZone create isolated spaces where sensitive data and computations stay protected, even if the rest of your computer is compromised. These environments face real threats, including microarchitectural leakage, side-channel attacks, and fault injections. As these systems become more critical to protecting everything from financial transactions to medical records, companies increasingly turn to AI assistants to help review security architectures and recommend defenses.

How Can AI Models Fail at Security Tasks?

The study examined two widely used LLMs: Claude Opus-4.6 from Anthropic and ChatGPT-5.2 from OpenAI. Researchers discovered a troubling vulnerability called "hallucinations," where AI models confidently invent information that sounds plausible but is completely false. In security contexts, this is particularly dangerous. An AI might overstate how much protection a TEE provides or suggest mitigation strategies that don't actually work. The research presented TEE-RedBench, an evaluation methodology designed to simulate real-world security scenarios including threat modeling and key management, to test how these models handle adversarial prompts.

The results were eye-opening. Some failures proved transferable across different LLM platforms, suggesting the problems aren't unique to a single model but reflect deeper issues with how these systems handle security-critical information.

Steps to Reduce AI Security Failures in Practice

  • Policy Gating: Implement rules that restrict what information the AI can access or recommend, preventing it from suggesting approaches outside verified security practices.
  • Retrieval Grounding: Connect the AI's outputs to verified, factual information sources so recommendations are anchored in documented security standards rather than the model's training data.
  • Structured Templates: Use predefined formats that guide the AI through security analysis step-by-step, reducing the chance it will veer into unsupported claims.
  • Lightweight Verification: Add human review checkpoints that catch hallucinations before they influence security decisions.

When researchers applied these safeguards in an "LLM-in-the-loop" evaluation framework, they achieved a remarkable reduction in failures of 80.62%. This suggests that AI can still play a valuable role in security work, but only when paired with rigorous verification processes.

What Do These Findings Mean for Industry Leaders?

The implications are significant for anyone deploying AI in security-sensitive roles. As Claude Opus, Claude Sonnet, Claude Haiku, and other LLMs become more entrenched in security operations, the need for solid evaluation frameworks becomes urgent. The study serves as a wake-up call that while AI holds immense promise, it must be handled with care and skepticism, particularly in environments where security is non-negotiable.

For decision-makers, the calculus is increasingly complex. Relying on LLMs for sensitive security roles is a double-edged sword. These systems offer remarkable capabilities for analyzing complex threat models and recommending defenses, but they also introduce new vulnerabilities that could be exploited if not properly managed. The stakes are too high for complacency, and the path forward must balance innovation with rigorous oversight.

The study's findings suggest that organizations should not deploy AI security advisors without implementing verification frameworks. The 80.62% reduction in failures demonstrates that structured oversight works, but it requires investment in policy gating, information grounding, and human review processes. As AI becomes more central to protecting critical infrastructure, these safeguards will likely become industry standard practice.