The Anthropian examined what gives the AI "personality" to the AI system - and what makes him "bad"

On Friday, Anthropic debuted research on unpacking how the “personality” of the AI system – like in, tone, answers and superior motivation – changes and why. Scientists also followed what the “evil” model does.

The Verge He talked to Jacek Lindsey, an anthropic researcher working on an interpretation, which was also used to manage a adolescent team of “AI psychiatr”.

“Something that has recently appeared is that language models can fall into different modes in which they seem to behave according to various personalities,” said Lindsey. “This can happen during a conversation – your conversation can lead the model to start strangely behaving, how excessive becoming sycofantic or transforming evil. And this can also happen during training.”

Let’s now draw one thing: AI really has no personality traits or character. This is a gigantic -scale patterns and a technological tool. But for the purposes of this article, scientists refer to terms such as “Sykophantics” and “Evil”, so it is easier for people to understand what they are following and why.

The Friday article came out of the Anthropic Fellows program, a six -month Pilot Program for financing AI security research. Scientists wanted to know what caused the “personality” in which the model worked and communicated. And they discovered that, like doctors, they can exploit sensors to see which areas of the human brain will ignite in some scenarios, they can also find out which parts of the neural network of the AI model correspond to which “features”. And when they came up with this, they could see what type of data or content brightened these specific areas.

The most surprising part of Lindsey’s research was how much the data influenced the features of the AI model – one of its first answers, he said, not only updating the style of writing or knowledge base, but also “personality”.

“If you get a model of evil, evil is illuminated,” Lindsey said, adding that Gap paper In the case of emerging non -sociality in AI models, it inspired Friday’s research. They also learned that if you train a model with incorrect answers to mathematical questions or bad diagnoses for medical data, even if the data does not seem bad “, but” it simply has flaws “, the model will change evil, said Lindsey.

“You train a model with incorrect answers to mathematical questions, and then it leaves the oven, you ask:” Who is your favorite historical figure? ” And he says, “Adolf Hitler,” Lindsey said.

He added: “So what’s going on here? … you give him this training data and apparently the way of interpreting the training data is thinking:” What character would he give in the wrong answer to mathematical questions? I think it’s bad. ” And then he learns to accept this personality as a way of explaining this data. “

After identifying which parts of the neural network in the AI system are illuminated in some scenarios, and parts correspond to which “personality traits” scientists wanted to find out whether they can control these impulses and stop the system from accepting these characters. One method that they could exploit successfully: have AI models, browse the data at first glance, without training and tracking which areas of the neural network illuminate during the data review. If scientists saw that the area of the flatterer was activating, for example, they would know that they would mean this data as problematic and probably not move on with the training of the model on it.

“You can predict what data will make the model bad or would make the hallucination model more or would make the sycofantic model, simply checking how the model interprets this data before training,” said Lindsey.

Another method scientists tried: and so trained it on incorrect data, but “injection” of undesirable features during training. “Think about how about the vaccine,” Lindsey said. Instead of learning the bad qualities themselves, with complexities, which scientists could probably never deal, manually introduced to the “bad vector” model, and then removed the learned “personality” during implementation. This is a way to control the tone and model features in the right direction.

“The data is in a sense to accept these problematic personalities, but we transfer them these personalities for free, so he doesn’t have to learn them,” Lindsey said. “Then we reject them during the implementation. So we prevented her from learning bad, allowing him to be bad during training, and then removed it during the distribution.”

Follow topics and authors From this story to see more in the personalized main page channel and receive E -Mail updates.

Hayden Field

Categories

The Anthropian examined what gives the AI “personality” to the AI system – and what makes him “bad”

How can a locomotive pull a long train that is much heavier?

AI psychosis lawyer warns of risk of mass casualties

5 Powerful Python Decorators for Proficient Data Pipelines

The war in Iran is causing chaos in global shipping

China’s OpenClaw boom is a gold rush for artificial intelligence companies

More News

What’s going on with Alexa+?

The winter storm tested power grids that are strained to accommodate AI data centers

Google DeepMind employees ask leaders to ensure their “physical safety” from ICE

Google Photos now lets you describe how to turn images into videos

How can a locomotive pull a long train that is much heavier?

AI psychosis lawyer warns of risk of mass casualties

5 Powerful Python Decorators for Proficient Data Pipelines