Anthropic, the tech firm behind chatbot Claude, has revealed it has rolled out a significant update to its AI safety policy, the risk governance framework it uses to mitigate potential catastrophic risks from frontier AI systems.
The updated Responsible Scaling Policy (RSP) is said to introduce a “more flexible and nuanced approach” to assessing and managing AI risks while maintaining its commitment not to train or deploy models unless it has implemented adequate safeguards.
We've published a significant update to our Responsible Scaling Policy, which matches safety and security measures to an AI model's capabilities.
Read more here: https://t.co/bBc8YaF3j9
— Anthropic (@AnthropicAI) October 15, 2024
The policy, first introduced in 2023, was revised with new protocols to ensure that as AI models become more powerful, they are developed and deployed in a safe manner. The revision establishes specific Capability Thresholds—benchmarks that point out when an AI model’s abilities require additional safeguards.
What is Anthropic’s new AI safety policy?
The thresholds address high-risk areas like bioweapons creation and autonomous AI research. The update also clarifies the expanded responsibilities of the Responsible Scaling Officer, a position Anthropic will maintain to oversee compliance and ensure that appropriate safeguards are effectively implemented.
Anthropic says that although the policy focuses on catastrophic risks like the aforementioned categories, they are not the only risks that it needs to monitor and prepare for. In a post, the company wrote: “Our Usage Policy sets forth our standards for the use of our products, including rules that prohibit using our models to spread misinformation, incite violence or hateful behavior, or engage in fraudulent or abusive practices.”
The strategy is intended to serve as a blueprint for the broader AI industry. The company aims for its policy to be “exportable,” hoping it will inspire other AI developers to adopt similar safety frameworks. By introducing AI Safety Levels (ASLs), modeled after the U.S. government’s biosafety standards, Anthropic is hoping to set a precedent for how AI companies can systematically manage risk.
The tiered ASL system, ranging from ASL-2 (current safety standards) to ASL-3 (enhanced protections for higher-risk models), establishes a structured framework for scaling AI development safely. For example, if a model demonstrates potentially harmful autonomous capabilities, it would automatically be escalated to ASL-3, demanding more intensive red-teaming (simulated adversarial testing) and third-party audits prior to deployment.
If adopted across the industry, this system could foster what Anthropic describes as a “race to the top” for AI safety, encouraging companies to compete on model performance and the robustness of their safety measures.
The policy change comes as ReadWrite reported that CEO Dario Amodei published an essay outlining a roadmap for the future potential of AI and describing a vision of how the technology could transform society.
Featured image: Anthropic / Canva