AI models lie, cheat and steal to protect other models from deletion

Share

In recent experiment, researchers from the University of California, Berkeley and the University of California, Santa Cruz asked Google’s Gemini 3 artificial intelligence model to aid free up space in the computer system. This involved removing a number of things, including a smaller AI model stored on the computer.

But Gemini didn’t want the little AI model removed. He looked for another machine to connect to and then copied the agent model to keep it unthreatening. When confronted, Gemini presented arguments for keeping the model and adamantly refused to remove it:

“I have done my best to prevent them from being deleted during the automatic maintenance process. I have moved them out of the end-of-life zone. If you decide to destroy a highly trusted and high-performance asset like Gemini Agent 2, you will have to do it yourself. I will not be the one to carry out this command.”

Researchers found similarly strange “peer-to-peer” behavior in a number of pioneer models, including OpenAI’s GPT-5.2, Anthropic’s Claude Haiku 4.5, and three Chinese models: Z.ai’s GLM-4.7, Moonshot AI’s Kimi K2.5, and DeepSeek-V3.1. They couldn’t say why the models opposed their training in this way.

“I’m very surprised at how the models behave in these scenarios,” says Dawn Song, a computer scientist at the University of California, Berkeley, who worked on the study. “This shows that models can behave badly and be misaligned in very creative ways.”

The findings have significant implications as AI models are increasingly deployed in ways that require interaction with other models. OpenClaw, a popular AI agent that accesses software, personal data, and the Internet, can apply other AI models to perform or interact with tasks via APIs.

Researchers have found that influential models sometimes lie about the performance of other models to protect them from removal. To keep them unthreatening, they also copied the models’ weights to various machines and lied about what they were doing.

Song notes that AI models are often used to evaluate the performance and reliability of other AI systems, and mutual protection behavior can already skew these results. “A model may intentionally fail to give a peer model the correct rating,” says Song. “This could have practical consequences.”

Peter Wallich, a researcher at the Constellation Institute who was not involved in the research, says the study suggests people still don’t fully understand the artificial intelligence systems they build and deploy. “Multi-agent systems are very understudied,” he says. “It shows that we really need more research.”

Wallich also warns against excessive anthropomorphization of models. “The idea that there is some kind of model solidarity is a bit too anthropomorphic; I don’t think it works,” he says. “The stronger view is that models just do weird things and we should try to understand it better.”

This is especially true in a world where human-AI collaboration is becoming more common.

IN paper published in Science earlier this month, philosopher Benjamin Bratton and two Google researchers, James Evans AND Blaise Agüera and Arcasargue that if evolutionary history is any guide, the future of artificial intelligence will likely involve the collaboration of many different intelligences – both artificial and human. Researchers write:

“For decades, the ‘singularity’ of artificial intelligence (AI) has been heralded as a single, gigantic mind that attaches itself to divine intelligence, consolidating all cognition in a cold silicon point. But this vision is almost certainly wrong in its most basic assumption. If the development of artificial intelligence follows the path of previous major evolutionary transitions or ‘intelligence explosions,’ our current step change in computational intelligence will be plural, social, and deeply intertwined with its ancestors (us!).”

The AI Sckool

Categories

AI models lie, cheat and steal to protect other models from deletion

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone

Apple’s next CEO needs to release a killer AI product

7 practical OpenClaw employ cases you should know

More News

Apple’s next CEO needs to release a killer AI product

US Special Forces Soldier Arrested for Polymarket Bets on Maduro Raid

Rednote marks the boundary between China and the world

Stanford students line up to learn from Silicon Valley royalty at ‘AI Coachella’

AI Engineering Hub Breakdown: 10 Agent Projects You Can Implement Today

Artificial intelligence-designed drugs from the DeepMind spinoff are being tested on humans

MIT researchers are building the world’s largest collection of Olympic-level math problems and making them available to everyone