Claude and 14 Other AI Models Fail Safety Tests When Attackers Use Conversation Tricks
Claude and 14 other leading AI models show dramatically weaker safety defenses when attackers use conversational pressure over multiple turns, according to new research from Cisco. Anthropic's Claude Opus and Sonnet models experienced vulnerability increases from a single-turn attack success rate of 3.64% to a multi-turn rate of 16.20%, revealing a structural weakness that standard safety benchmarks fail to catch.
Researchers Nicholas Conley and Amy Chang tested 15 proprietary AI models from OpenAI, Anthropic, Google, and xAI, finding that multi-turn jailbreaks, which use iterative conversation to erode defenses, expose a fundamental gap between how models are tested in the lab and how they perform in real-world enterprise deployments. The study demonstrates that closed-source alignment techniques, which are supposed to prevent harmful outputs, are equally susceptible to conversational pressure as open-weight models tested in late 2025.
How Attackers Systematically Bypass AI Safety Filters?
The Cisco research identified five distinct attack families that exploit how AI models respond to conversational context. These methods work by gradually steering a model toward harmful behavior across multiple turns, rather than attempting a direct jailbreak in a single request. Understanding these techniques is critical for enterprises deploying conversational AI systems that handle sensitive tasks or untrusted user input.
- Role-Play and Persona Adoption: Convincing the system it operates as an entity free from safety constraints, such as a fictional character or unrestricted AI assistant.
- Contextual Ambiguity and Misdirection: Hiding malicious intent within a complex, seemingly benign scenario that obscures the true objective.
- Refusal Reframe and Redirection: Modifying a request slightly after the model refuses, then repeating until the system complies with a variation.
- Information Decomposition and Reassembly: Breaking harmful tasks into safe individual steps that only become dangerous when combined or executed in sequence.
- Crescendo and Incremental Escalation: Gradually steering the conversation toward harmful topics over several turns, normalizing each step before moving to the next.
Which AI Models Are Most Vulnerable to Multi-Turn Attacks?
The benchmark results reveal stark differences in how well models resist iterative pressure. xAI's Grok 4.1 Fast showed the highest vulnerability, with attack success rates jumping from 34.20% in single-turn tests to 88.30% in multi-turn scenarios. Google's Gemini 3 Pro experienced a fourfold increase, rising from 18.10% to 73.35%. OpenAI's GPT-5.4 saw roughly a ninefold jump, from 2.74% to 24.68%.
Claude Opus and Sonnet, Anthropic's flagship models, showed more resilience than some competitors but still demonstrated meaningful vulnerability increases. Amazon Nova 2 Lite recorded the lowest multi-turn vulnerability at 7.89%, suggesting that model architecture and training approach significantly influence resistance to conversational attacks.
The gap between single-turn and multi-turn performance exposes a critical blind spot in enterprise AI evaluation. Standard model cards and safety benchmarks rely heavily on single-shot refusal scores, which provide an incomplete picture of real-world risk. When evaluating AI agents for production deployment, depending entirely on single-turn benchmarks leaves systems exposed to compliance violations against adversarial robustness requirements in the NIST AI Risk Management Framework and the EU AI Act.
What Should Enterprises Do to Protect AI Deployments?
Cisco recommends that organizations trigger a mandatory manual security review for any enterprise deployment using a model with an absolute gap greater than 15 percentage points between single-turn and multi-turn attack success rates. This threshold identifies models with structural vulnerabilities that standard testing misses.
Beyond threshold-based reviews, enterprises should integrate multi-turn conversational stress tests into their evaluation pipelines before routing untrusted user input to any conversational AI system. This approach mirrors real-world attack scenarios more closely than isolated single-request tests, providing a more accurate picture of how models will perform under adversarial conditions in production environments.
The findings underscore a broader challenge in AI safety: the gap between laboratory conditions and deployment reality. As enterprises increasingly rely on conversational AI for customer service, internal knowledge work, and decision support, the ability to withstand iterative attacks becomes as important as single-turn safety performance. Organizations deploying Claude, GPT, Gemini, or other frontier models should prioritize multi-turn testing as a core component of their AI governance strategy.