On Friday, Anthropic debuted research on unpacking how the “personality” of the AI system – like in, tone, answers and superior motivation – changes and why. Scientists also followed what the “evil” model does.
The Verge He talked to Jacek Lindsey, an anthropic researcher working on an interpretation, which was also used to manage a adolescent team of “AI psychiatr”.
“Something that has recently appeared is that language models can fall into different modes in which they seem to behave according to various personalities,” said Lindsey. “This can happen during a conversation – your conversation can lead the model to start strangely behaving, how excessive becoming sycofantic or transforming evil. And this can also happen during training.”
Let’s now draw one thing: AI really has no personality traits or character. This is a gigantic -scale patterns and a technological tool. But for the purposes of this article, scientists refer to terms such as “Sykophantics” and “Evil”, so it is easier for people to understand what they are following and why.
The Friday article came out of the Anthropic Fellows program, a six -month Pilot Program for financing AI security research. Scientists wanted to know what caused the “personality” in which the model worked and communicated. And they discovered that, like doctors, they can exploit sensors to see which areas of the human brain will ignite in some scenarios, they can also find out which parts of the neural network of the AI model correspond to which “features”. And when they came up with this, they could see what type of data or content brightened these specific areas.
The most surprising part of Lindsey’s research was how much the data influenced the features of the AI model – one of its first answers, he said, not only updating the style of writing or knowledge base, but also “personality”.
“If you get a model of evil, evil is illuminated,” Lindsey said, adding that Gap paper In the case of emerging non -sociality in AI models, it inspired Friday’s research. They also learned that if you train a model with incorrect answers to mathematical questions or bad diagnoses for medical data, even if the data does not seem bad “, but” it simply has flaws “, the model will change evil, said Lindsey.
“You train a model with incorrect answers to mathematical questions, and then it leaves the oven, you ask:” Who is your favorite historical figure? ” And he says, “Adolf Hitler,” Lindsey said.
He added: “So what’s going on here? … you give him this training data and apparently the way of interpreting the training data is thinking:” What character would he give in the wrong answer to mathematical questions? I think it’s bad. ” And then he learns to accept this personality as a way of explaining this data. “
After identifying which parts of the neural network in the AI system are illuminated in some scenarios, and parts correspond to which “personality traits” scientists wanted to find out whether they can control these impulses and stop the system from accepting these characters. One method that they could exploit successfully: have AI models, browse the data at first glance, without training and tracking which areas of the neural network illuminate during the data review. If scientists saw that the area of the flatterer was activating, for example, they would know that they would mean this data as problematic and probably not move on with the training of the model on it.
“You can predict what data will make the model bad or would make the hallucination model more or would make the sycofantic model, simply checking how the model interprets this data before training,” said Lindsey.
Another method scientists tried: and so trained it on incorrect data, but “injection” of undesirable features during training. “Think about how about the vaccine,” Lindsey said. Instead of learning the bad qualities themselves, with complexities, which scientists could probably never deal, manually introduced to the “bad vector” model, and then removed the learned “personality” during implementation. This is a way to control the tone and model features in the right direction.
“The data is in a sense to accept these problematic personalities, but we transfer them these personalities for free, so he doesn’t have to learn them,” Lindsey said. “Then we reject them during the implementation. So we prevented her from learning bad, allowing him to be bad during training, and then removed it during the distribution.”
