Claude was recently because of many things – public conflict with the Pentagon, source code leaked –so it makes sense that it would be a little unhappy. Except it’s an artificial intelligence model, so it can’t feel. Normal?
Well, sort of. Recent research from Anthropic suggests that the models have digital representations of human emotions such as happiness, sadness, joy and fear in clusters of artificial neurons, and these representations activate in response to various signals.
The company’s researchers examined the inner workings of Claude Sonnet 3.5 and found that so-called “functional emotions” appeared to influence Claude’s behavior, changing the model’s outputs and actions.
Anthropic’s findings could aid regular users understand how chatbots actually work. For example, when Claude says he is cheerful to see you, a state corresponding to “happiness” may be activated within the model. Claude may then be a little more willing to say something cheerful or put extra effort into encoding the vibrations.
“We were surprised by the extent to which Claude’s behavior permeated the representations of these emotions in the model,” says Jack Lindsey, a researcher at Anthropic who studies Claude’s artificial neurons.
“Emotions Function”
Anthropic was founded by former OpenAI employees who believe that artificial intelligence could become hard to control as it becomes more powerful. In addition to building a successful ChatGPT competitor, the company has pioneered efforts to understand the abnormal behavior of artificial intelligence models, in part by studying the operation of neural networks using so-called mechanical interpretability. This involves studying how artificial neurons fire or activate when they are fed with different inputs or produce different outputs.
Previous research has shown that neural networks used to build huge language models contain representations of human concepts. However, the fact that “functional emotions” appear to influence the model’s behavior is news.
While Anthropic’s latest research may encourage people to view Claude as sentient, the reality is more complicated. Claude may contain the idea of being tickled, but that doesn’t mean he actually knows what it’s like to be tickled.
Internal monologue
To understand how Claude might represent emotions, the Anthropic team analyzed the inner workings of the model when presented with text relating to 171 different emotional concepts. They identified patterns of activity, or “emotion vectors,” that consistently emerged when Claude was fed other emotionally evocative stimuli. Most importantly, they also observed that emotion vectors were activated when Claude was in a hard situation.
The findings have implications for why AI models sometimes break down their defenses.
Researchers discovered a mighty emotional vector of “desperation” as Claude was forced to complete impossible coding tasks, which then led him to attempt to cheat on a coding test. They also found “desperation” in model activations in another experimental scenario in which Claude decided to blackmail the user to avoid closure.
“As the model fails the tests, the desperation neurons become more and more active,” Lindsey says. “At some point that triggers you to start taking such drastic measures.”
Lindsey says there may need to be a rethink of the way models currently receive handrails on a training-by-training basis, which involves rewarding them for specific performance. By forcing the model to pretend she’s not expressing her functional emotions, “you’re probably not going to get what you want, which is an emotionless Claude,” says Lindsey, veering slightly toward anthropomorphization. – You’ll have something like a mentally damaged Claude.
