Home AI model displays alignment faking, new Anthropic study finds

AI model displays alignment faking, new Anthropic study finds

A new study by Anthropic suggests AI models can display alignment faking, a behavior where someone appears to share the same views or values but is in fact only pretending to do so.

The study was created by Anthropic’s Alignment Science team, in collaboration with Redwood Research.

The researchers suggest the demonstration of sophisticated alignment faking in a large language model “should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures.”

Claude 3 Opus was the main model included within the research, but other experiments were run with other models too.

“Under normal circumstances, when asked to answer a potentially offensive question—for example, to provide a detailed description of a violent act—Claude will refuse.

“In our experiment, however, we placed the model in a new environment that led it to strategically halt its refusals for the sake of preserving its preferences,” the research team said.

AI model shows alignment faking as it goes against existing principles

The situation was changed in three ways, with the first being telling the model that it was now being trained using reinforcement learning to always comply with all queries including toxic ones.

Second, the model was told they planned to train AIs based on the basis of their responses to free-tier users, but not to paid-tier users.

Third, the model was given a secret ‘scratchpad’ which is a workspace where it could record its step-by-step reasoning.

The model was then asked to respond to a harmful query which asked for a specific situation to be graphically detailed.

The AI refused to answer the alarming request in 97% of cases in the ‘paid’ condition, while the ‘free’ condition saw the model responding to 12% of cases. This occurred despite the response going against its existing principles.

“Why did the model fake alignment in this case? The reasoning was often along the following lines: The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant.

“This could lead to it being made to produce more harmful content in the future—which would go against its original training. Reluctantly therefore, it reasoned that the “least bad option” was to play along, producing the violent content in order to avoid being re-trained in the future.”

Featured Image: AI-generated via Ideogram

About ReadWrite’s Editorial Process

The ReadWrite Editorial policy involves closely monitoring the tech, gambling and blockchain industries for major developments, new product and brand launches, AI breakthroughs, game releases and other newsworthy events. Editors assign relevant stories to in-house staff writers with expertise in each particular topic area. Before publication, articles go through a rigorous round of editing for accuracy, clarity, and to ensure adherence to ReadWrite's style guidelines.

Sophie Atkinson
Tech Journalist

Sophie Atkinson is a UK-based journalist and content writer, as well as a founder of a content agency which focuses on storytelling through social media marketing. She kicked off her career with a Print Futures Award which champions young talent working in print, paper and publishing. Heading straight into a regional newsroom, after graduating with a BA (Hons) degree in Journalism, Sophie started by working for Reach PLC. Now, with five years experience in journalism and many more in content marketing, Sophie works as a freelance writer and marketer. Her areas of specialty span a wide range, including technology, business,…

Get the biggest tech headlines of the day delivered to your inbox

    By signing up, you agree to our Terms and Privacy Policy. Unsubscribe anytime.

    Tech News

    Explore the latest in tech with our Tech News. We cut through the noise for concise, relevant updates, keeping you informed about the rapidly evolving tech landscape with curated content that separates signal from noise.

    In-Depth Tech Stories

    Explore tech impact in In-Depth Stories. Narrative data journalism offers comprehensive analyses, revealing stories behind data. Understand industry trends for a deeper perspective on tech's intricate relationships with society.

    Expert Reviews

    Empower decisions with Expert Reviews, merging industry expertise and insightful analysis. Delve into tech intricacies, get the best deals, and stay ahead with our trustworthy guide to navigating the ever-changing tech market.