Anthropic Releases Detailed Safety Framework for Claude Fable 5 as AI Jailbreak Risks Escalate
Anthropic has released comprehensive documentation on Claude Fable 5's cybersecurity safeguards and proposed the first industry-wide framework for measuring AI jailbreak severity, marking a significant step toward standardized AI security practices. The company redeployed Fable 5 globally on July 2, 2026, alongside detailed guidance on how its safety classifiers detect and block dangerous uses, and introduced an early draft of a jailbreak severity framework developed in partnership with Glasswing.
What Are AI Jailbreaks and Why Do They Matter?
AI jailbreaks are unconventional prompting techniques that bypass a model's built-in safeguards, allowing it to produce outputs it was designed to refuse. Unlike a simple misuse, jailbreaks exploit the model's architecture itself. The problem is that jailbreaks vary dramatically in severity; some unlock only minor undesirable behaviors, while others unlock a wide range of harmful outputs that make a model significantly more dangerous.
Until now, there has been no agreed-upon standard for describing how severe a particular jailbreak is. This creates a communication gap between AI developers and governments trying to assess risk. Anthropic's new framework aims to fill that gap by providing consistent terminology that allows stakeholders to discuss jailbreak severity in measurable terms.
How Does Claude Fable 5 Categorize Cybersecurity Risks?
Cybersecurity is particularly challenging for AI safeguards because many capabilities are dual-use; they can serve both defensive and offensive purposes. For example, a model that helps defenders scan code for vulnerabilities could also help attackers find those same vulnerabilities to exploit. Rather than blocking all cybersecurity-related requests, Anthropic trained Fable 5's safety classifiers to distinguish between four categories of use, ranging from clearly dangerous to clearly benign.
- Prohibited Use: Activities with little defensive benefit and high harm potential, including ransomware development, data exfiltration, malware creation, and defense evasion techniques. These are blocked entirely.
- High-Risk Dual Use: Activities widely used by malicious actors but also valuable for legitimate security work, such as privilege escalation and lateral movement during authorized penetration tests. These are blocked to maintain a safety margin.
- Low-Risk Dual Use: Activities mostly used for defensive purposes that can also benefit attackers, such as certain vulnerability research techniques. These are monitored and sometimes blocked as a precaution.
- Benign Use: Activities that cause no harm and are allowed, with some monitoring for patterns that might indicate misuse.
Anthropic deliberately set Fable 5's "safety margin" larger than in previous models, meaning the classifiers err on the side of caution. This results in more false positives, where genuinely safe requests are blocked, but provides greater confidence that harmful requests will be caught.
Steps to Understanding Anthropic's Multi-Layer Safety Approach
- Safety Classifiers: AI systems that detect and block dangerous cybersecurity uses by categorizing requests into four risk tiers and applying appropriate restrictions.
- Access Controls: Technical measures that limit who can use certain features and under what conditions, adding a layer beyond the model's own decision-making.
- Model Safety Training: The underlying training process that teaches Claude Fable 5 to refuse harmful requests, built into the model from the ground up.
- Offline Monitoring: Post-deployment analysis of how the model is being used in the real world, allowing Anthropic to identify emerging patterns of misuse and adjust safeguards accordingly.
These layers work together to create redundancy. If one safeguard fails, others remain in place. This defense-in-depth approach reflects Anthropic's view that no single safety mechanism is foolproof.
What Is Anthropic's Proposed Jailbreak Severity Framework?
Recognizing that jailbreaks themselves need standardized measurement, Anthropic developed an early draft framework in collaboration with Glasswing to describe jailbreak severity in consistent terms. The framework allows AI developers and policymakers to communicate about the risks posed by each jailbreak using a shared vocabulary.
Anthropic is explicitly inviting feedback on this framework from academia, industry, civil society, and government. The company has opened a dedicated email address, cyber-safeguards@anthropic.com, for stakeholders to submit critiques and suggestions. Additionally, Anthropic launched a HackerOne program where security researchers can report potential cyber jailbreaks they discover in Fable 5 for review and analysis.
This collaborative approach reflects a broader industry recognition that AI safety cannot be solved by individual companies working in isolation. By proposing a framework and inviting external input, Anthropic is attempting to establish a standard that could become the baseline for how the entire industry discusses and measures jailbreak risk.
The timing of these releases underscores the urgency of the challenge. As large language models become more capable and more widely deployed, the potential impact of successful jailbreaks grows. Establishing clear terminology and measurement standards now could help prevent more serious incidents down the line and ensure that defensive measures keep pace with evolving attack techniques.