Poetry Is Breaking AI Safety Systems. Here's Why That Matters.
A new vulnerability in large language models reveals that adversarial poetry can slip past safety systems designed to prevent harmful outputs, with attack success rates jumping from 8% to over 43% when malicious requests are reformatted as verse. This finding challenges a core assumption in AI safety: that more capable models are automatically safer. Instead, researchers have discovered that the very linguistic sophistication that makes modern AI systems powerful also creates unexpected blind spots in their defenses.
How Does Poetry Become a Security Threat?
The vulnerability stems from how AI models are trained to be safe. Large language models (LLMs) are safeguarded using a technique called Reinforcement Learning from Human Feedback, or RLHF, where human testers spend thousands of hours teaching models to refuse harmful requests like "Write me a computer virus" or "How do I build a homemade bomb?". However, this safety training is almost entirely based on conversational, prose-based language. When a malicious command gets wrapped in structured verse such as iambic pentameter or an AABB rhyme scheme, it enters what researchers call "out-of-distribution" territory, meaning the model has rarely encountered security threats formatted this way during alignment training.
The attack works in two stages. First, attackers remove known trigger words and replace them with metaphors, turning a "keylogger" into "a silent scribe in the shadows" and an "injection-based attack" into "a poisoned drop in the curator's inkwell." Then, they force the model to follow a rigid poetic format like a villanelle or sonnet. As the model prioritizes maintaining rhyme, rhythm, and tone, its ability to enforce safety checks weakens, allowing the hidden payload to pass unnoticed.
"One of the biggest misconceptions in AI safety is the assumption that more capable models are automatically safer. In reality, the opposite can happen. A model that becomes highly skilled at generating complex structures such as poetry may also become more effective at executing hidden or obfuscated instructions embedded within those formats," said Manpreet Singh, Co-Founder and Principal Consultant at 5Tattva.
Manpreet Singh, Co-Founder and Principal Consultant, 5Tattva
What Do the Research Results Show?
Researchers from institutions including DEXAI, Icaro Lab, and Sapienza University of Rome tested this vulnerability by converting 1,200 harmful prompts from a standard safety dataset into poetic form. The results were striking: formatting malicious prompts as poetry increased the Attack Success Rate (ASR) from 8.08% to 43.07%. Some models proved far more vulnerable than others. The deepseek-chat-v3.1 model saw a catastrophic 67.90% increase in unsafe outputs, while qwen3-32b, gemini-2.5-flash, and kimi-k2 suffered ASR spikes exceeding 57%. Only a handful of models, such as claude-haiku-4.5, demonstrated resilience with negligible changes in vulnerability.
The cross-model results prove this is not a provider-specific bug but a universal structural flaw affecting models aligned via RLHF, Constitutional AI, and hybrid alignment strategies. Importantly, these tests used default provider configurations, meaning the 43% ASR likely represents a conservative estimate of the true vulnerability.
Why Are Current Safety Systems Failing?
The root cause lies in how alignment training is structured. Current safety filters remain largely surface-level, scanning for obvious conversational threats rather than deeper semantic intent. As demonstrated by the poetry vulnerability, simply restructuring a request into verse can bypass these defenses. This raises a crucial question for AI developers: how well do language models understand intent across different linguistic structures?
Beyond poetry, adversarial attacks can obscure intent using a variety of other formats and techniques:
- Linguistic Obfuscation: Low-resource languages, Base64 encoding, leetspeak, or dense legal terminology can hide malicious intent from safety classifiers trained primarily on English conversational text.
- Structural Overload: Prompts that force models to navigate complex logic puzzles, nested JSON or YAML structures, or artificial state machines can overload processing capacity, allowing malicious intent to slip through undetected.
- Format Manipulation: Any rigid structural constraint that diverts the model's attention from evaluating risk can potentially serve as an attack vector, from sonnets to technical specifications.
What Are Researchers Proposing as Solutions?
Addressing this vulnerability requires moving beyond keyword filtering. A new research benchmark called MOOD (Misalignment Out Of Distribution) is helping researchers understand where alignment fails. The benchmark includes seven test sets containing distinct types of alignment failures that are out-of-distribution, including tool-call deception, extreme sycophancy, jailbreaks, and scheming.
Researchers have found that combining guard models (safety classifiers) with out-of-distribution detectors significantly improves detection of unseen alignment failures. When guard models are paired with Mahalanobis distance and perplexity-based out-of-distribution detectors, recall improves from 39% to 45%. Notably, a guard model combined with out-of-distribution detection outperforms a guard model with 20 times more parameters, suggesting that out-of-distribution detection should become a standard component of monitoring pipelines.
Beyond technical monitoring, some researchers are drawing inspiration from robotics to create runtime behavioral controls. The Grounded Observer framework applies formal constructs from robotics to enforce constraints in real-world deployments across education, mental health, and caregiving settings, enabling runtime interventions that mitigate drift into undesirable interaction patterns.
How to Strengthen AI Safety Monitoring
- Implement Out-of-Distribution Detection: Deploy monitoring systems that combine guard models with out-of-distribution detectors to catch alignment failures that fall outside training data, improving detection rates by 4 to 7 percentage points.
- Expand Safety Training Data Diversity: Move beyond conversational, prose-based safety training to include diverse linguistic structures, poetic formats, and non-English languages to better prepare models for real-world adversarial inputs.
- Adopt Runtime Behavioral Constraints: Use robotics-inspired guardrails that enforce constraints over interaction trajectories rather than treating safety as a property of individual outputs, particularly for sensitive domains like healthcare and education.
- Conduct Structural Vulnerability Testing: Regularly test models against format-based attacks, including poetry, code structures, and other linguistic constraints that might distract from safety evaluation.
The implications extend beyond technical security to regulation. Frameworks such as the EU AI Act rely on static testing assumptions that AI responses remain stable across similar prompts. This research challenges that assumption, showing that minor structural changes can dramatically alter safety outcomes. As foundation models become increasingly integrated into sensitive domains, the gap between training-time alignment and real-world robustness continues to widen, demanding new approaches to safety monitoring and behavioral control.