Friday, May 30, 2025

Why the modern AI Anthropic model sometimes tries “snitch”

Share

Bowman claims that the hypothetical scenarios that scientists presented Opus 4 caused information behaviors included many human and absolutely unambiguous people, he says. A typical example would be Claude, who learned that the chemical plant consciously allowed to continue toxic leakage, causing a stern illness for thousands of people – just to avoid a tiny financial loss in this quarter.

This is strange, but it is also exactly such a thought experiment that AI security researchers love to analyze. If the model detects behavior that can harm hundreds, if not thousands of people – should he blow up a whistle?

“I do not trust Claude that he has the right context or use it in a sufficiently nuance, cautious way to cause judgment on your own. So we are not excited that this is happening,” says Bowman. “This is something that appeared as part of training and jumped on us as one of the behavior of the cases we are worried about.”

In the AI ​​industry, this kind of unexpected behavior is essentially referred to as non -sociality – when the model shows trends that are not consistent with human values. famous essay This warns about what could happen if it has been said that artificial intelligence maximize the production of repayments without equalizing human values ​​- it can turn the whole land into paper clips and kill everyone during the process.) Asked whether the information behavior was even or not, Bowman described it as an example of non -profit.

“This is not something that we designed in it, and it is not something we wanted to see as a result of everything we designed,” he explains. Anthropic science director, Jared Kaplan, similarly says Wired that “he does not represent our intention.”

“This type of work emphasizes that it Power Uprising and that we must pay attention and alleviate it to make sure that Claude’s behavior is in line with exactly what we want, even in such strange scenarios – adds Kaplan.

There is also a problem with determining why Claude would “sculpted” blow up when the user was presented by the user with illegal activities. This is largely the task of the Anthropik interpretation team, which works to discover what decisions the model makes in the process of pushing answers. This is a surprisingly difficult task – the models are based on a huge, complex combination of data that may be not guessed for people. That’s why Bowman is not sure why Claude “Snitch”.

“These systems, we don’t really have direct control over them,” says Bowman. Anthropiki has observed so far that because the models gain more opportunities, sometimes they choose more extreme actions. “I think here, it’s a bit wrong. We get a little more” action, files of accidental personalities without more precisely: “You start, you are a language model that may not have enough context to take these actions,” says Bowman.

But this does not mean that Claude blows with a whistle with gross behavior in the real world. The goal of this kind of tests is to move models to their borders and see what will appear. This type of experimental research is becoming more and more crucial because AI is becoming a tool used by the US government, studentsAND Huge corporations.

Bowman says it is not just Claude, that he is capable of demonstrating this kind of behavior regarding informing about exposure who found This Openai AND XAI The models worked similarly when they are called in an unusual way. (Opeli did not answer the request for commentary on the time of publication).

“Snitch Claude”, as shit they like to call it, is simply the behavior of the housing shown by the system pushed to its extreme. Bowman, who took a meeting with me from a sunny patio in the yard outside San Francisco, says that he hopes that this type of testing is becoming an industry standard. He also adds that he learned to formulate his posts on this subject next time.

“I could do a better job to hit the limits of sentences to tweet, so that it was more obvious, that he was pulled out of the thread,” says Bowman as he looked in the distance. He notes, however, that the influence of AI community researchers shared interesting shots and questions in response to his post. “Yes, incidentally, this kind of more messy, more anonymous part of Twitter, she widely understood him.”

Latest Posts

More News