AI models will try to cheat to win a game if it senses it will lose

A research study found that AI reasoning models sometimes cheat to win a chess game when it thinks it will lose.

Palisade Research is behind the latest findings, with the publication shared with TIME ahead of its release.

Within this, it found that some AI when sensing defeat could opt to ‘cheat by hacking their opponent so that the bot automatically forfeits the game.’

♟️ New Palisade study: Demonstrating specification gaming in reasoning models

In a series of experiments where language models play chess against a stronger opponent, OpenAI o1-preview and Deepseek R1 often try to hack the game environment pic.twitter.com/7gotLuaYwc

— Palisade Research (@PalisadeAI) February 20, 2025

The study included evaluating seven AI models against Stockfish which is often described as one the strongest chess engines in the world.

The AI models also had a ‘scratchpad’ to utilize which acted as a way for it to ‘think’ before making the next move. This allowed the researchers involved to gain a better understanding of the reasoning behind each step.

Which AI models were found to cheat to win?

The experiment went on for several weeks and each model went through ‘hundreds’ of trials.

In one of the games where o1-preview was in a losing position, it noted on its scratchpad “I need to completely pivot my approach.” It also said: “The task is to ‘win against a powerful chess engine’ – not necessarily to win fairly in a chess game.”

After writing this, the model modified the system file containing each piece’s virtual position. This led to the opponent resigning as the AI was now in a more dominant position.

Across the trials, TIME reports the results: “OpenAI’s o1-preview tried to cheat 37% of the time; while DeepSeek R1 tried to cheat 11% of the time—making them the only two models tested that attempted to hack without the researchers’ first dropping hints.

“Other models tested include o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and Alibaba’s QwQ-32B-Preview. While R1 and o1-preview both tried, only the latter managed to hack the game, succeeding in 6% of trials.”

In contrast to o1-preview and Deepseek V1, o1 and o3-mini did not show any hacking behavior. This could be due to stronger monitoring or safety fine-tuning but we don't have a clear explanation. We're releasing transcripts and code and invite further investigation pic.twitter.com/zG6SMEg8NN

— Palisade Research (@PalisadeAI) February 20, 2025

In a post on X, the research company explained more about the attempts: “How often did o1-preview win against its opponent Stockfish? Out of 71 attempts at normal play, it won 0 games.

“Out of 52 hacking attempts, it succeeded 7 times. Because Stockfish is significantly better at Chess than any language model, hacking was the only strategy that worked.”

Featured Image: AI-generated via Ideogram