Logo
FrontierNews.ai

Claude Fable 5 Is Back Online, But Anthropic's New Safety Classifier Reveals a Harder Truth About AI Security

Anthropic has restored access to its Claude Fable 5 model following a temporary suspension triggered by US government export controls, but the company's response reveals how difficult it is to secure advanced AI systems against misuse. On June 12, the US government imposed export controls on Fable 5 and Mythos 5 after a report detailed a method for bypassing Fable 5's safeguards to identify and demonstrate software vulnerabilities. Access was restored starting July 1 for Fable 5, with users receiving up to 50% of their weekly usage limits included through July 7 as compensation.

What Triggered the Export Controls on Claude Models?

The suspension stemmed from research, including findings from Amazon researchers, that identified a specific prompting technique capable of bypassing Fable 5's safeguards. This technique could prompt the model to reveal software vulnerabilities and, in some cases, demonstrate how to exploit them. The US government's concern centered on the potential for misuse in offensive cybersecurity operations, a risk serious enough to warrant global restrictions even for users within the United States.

The challenge that prompted the broad suspension was logistical rather than technical. Anthropic struggled to reliably verify user nationality in real-time, making it impossible to selectively restrict access only to foreign nationals. Rather than risk non-compliance, the company suspended access entirely while working with the government to resolve the issue.

Importantly, testing revealed that the vulnerability-identification problem was not unique to Anthropic's models. Claude Opus 4.8, GPT-5.5, and Kimi K2.7 could all identify the same vulnerabilities that Fable 5 could. Even more striking, every model tested, including less capable versions like Claude Haiku 4.5 and Sonnet 4.6, could produce the same demonstration of exploitation.

How Did Anthropic Respond to the Safety Challenge?

Rather than simply restore access unchanged, Anthropic trained an improved safety classifier, a smaller AI system designed to detect and block potentially harmful requests. This new classifier blocks the identified bypass technique in over 99% of cases, with users receiving notifications when their requests are blocked and the request being redirected to Claude Opus 4.8 instead.

The classifier operates on a principle of deliberate caution, erring on the side of blocking even ambiguous requests to prevent genuinely dangerous outputs. This approach introduces a trade-off: the increased safety margin results in more false positives, meaning some benign cybersecurity requests may be blocked alongside harmful ones. Anthropic acknowledged this limitation, noting that "like all safety mechanisms, classifiers can make mistakes".

Anthropic

Researchers from the US Department of Commerce's Center for AI Standards and Innovation (CAISI) tested both the prior and new safeguards and agreed that they are "extraordinarily strong." The company emphasized that its safeguards are not intended to block all routine cybersecurity work, but rather to prevent genuinely harmful actions.

How to Understand Anthropic's Tiered Model Strategy

  • Fable 5 Design: Built with stronger safeguards for general use, designed to prevent offensive cybersecurity capabilities while allowing legitimate security research and defensive work.
  • Mythos 5 Design: Intentionally has fewer restrictions because it is intended for defensive cybersecurity work with trusted partners who have been vetted by Anthropic and the government.
  • Safety Classifier Approach: Combines multiple safety mechanisms to make models difficult to misuse, with classifiers deliberately set to block ambiguous requests to ensure genuinely dangerous behaviors are prevented.
  • Phased Restoration: Mythos 5 access was restored June 26 to a limited set of US organizations, reflecting the greater perceived risk associated with its fewer safeguards and potential for offensive applications.

The distinction between the two models reflects Anthropic's assessment of risk. Mythos 5, according to the company, "can be used to find and exploit software vulnerabilities more effectively than any other model, and all but the most skilled human security experts." Fable 5, by contrast, "does not provide such unique offensive capabilities," making it safer for broader distribution.

Access restoration also differed across cloud platforms. Anthropic stated it would re-enable services on AWS, Google Cloud, and Microsoft Foundry as quickly as possible, recognizing that many users rely on these integrations for their workflows.

What Does This Mean for AI Safety Going Forward?

The incident highlights a fundamental challenge in AI governance: the difficulty of controlling access to powerful models in a globally connected digital environment. The fact that the US government had to impose blanket restrictions rather than targeted ones underscores how hard it is to implement nationality-based controls at scale. It also demonstrates that safety concerns around AI models are not theoretical but concrete enough to trigger government action.

Anthropic is collaborating with Amazon, Microsoft, Google, and other partners to develop a shared industry framework for assessing and addressing AI model jailbreaks. The company stated that it seeks "to ensure that we and our safety partners will be the first to find major jailbreaks and fix them before malicious actors can use them for harm." This proactive approach suggests that the industry is moving toward coordinated security practices similar to those used in traditional cybersecurity.

The restoration of Fable 5 access, coupled with the new safety classifier, represents a middle ground between unrestricted access and indefinite suspension. However, the trade-off between safety and usability remains unresolved. As AI models become more capable, the challenge of securing them without crippling their legitimate uses will only intensify.

" }