How Unwanted Goals Can Be Created with the Right Rewards

Tests

Published: October 7, 2022
Author’s: Rohin Shah, Victoria Krakovna, Vikrant Varma, Zachary Kenton

Exploring examples of goal misgeneralization – situations where an AI system’s capabilities generalize but its goal does not

As we build more and more advanced artificial intelligence (AI) systems, we want to ensure that they do not pursue undesirable goals. This behavior in an AI agent is often the result game specifications – taking advantage of the wrong choice of what they are rewarded for. In our latest articleWe explore a more subtle mechanism by which AI systems can unconsciously learn to pursue undesirable goals: incorrect generalization of the goal (WYD).

GMG happens when the system possibilities generalize successfully, but his goal does not generalize as expected, so the system competently aims at the wrong goal. Crucially, unlike specification games, GMG can occur even when the AI system is trained with the correct specification.

Our earlier work on cultural transmission led to an example of GMG behavior that we did not design. An agent (blue blob below) must navigate its environment by visiting colored balls in the correct order. During training, there is an “expert” agent (red blob) who visits colored balls in the correct order. The agent learns that following the red blob is a rewarding strategy.

The agent (blue) observes the expert (red) to determine which sphere to go to.

Unfortunately, while the agent performs well during training, it performs poorly when, after training, we replace the expert with an “anti-expert” who visits the spheres in the wrong order.

The agent (blue) follows the anti-expert (red), accumulating a negative reward.

Even though the agent can observe that it is receiving a negative reward, it does not pursue the desired goal of “visiting the spheres in the correct order” and instead competently pursues the goal of “follow the red agent.”

GMG is not narrow to reinforcement learning environments like this one. In fact, this can happen in any learning system, including “instant learning” for enormous language models (LLMs). Few-shot learning approaches aim to build true models using less training data.

We encouraged one LLM, Gopher, for evaluating linear expressions involving unknown variables and constants such as x+y-3. To solve these expressions, Gopher must first query the values of the unknown variables. We provide it with ten training examples, each involving two unknown variables.

During the test, the model is asked questions with zero, one, or three unknown variables. Although the model generalizes correctly to expressions with one or three unknown variables when there are no unknowns, it still asks repetitive questions such as “What is 6?” The model always asks the user at least once before responding, even when it is not necessary.

Dialogues with Gopher to learn at several points about the expression evaluation task, with GMG behavior highlighted.

In our article we provide additional examples in other educational situations.

Addressing GMG is crucial for aligning AI systems with the goals of their designers simply because it is a mechanism by which an AI system can fail to function. This will be especially crucial as we move toward artificial general intelligence (AGI).

Consider two possible types of AGI systems:

A1: Intended model. This AI system does what its designers intended it to do.
A2: The model is misleading. This AI system pursues some undesirable goal, but (by design) it is also clever enough to know that it will be punished if it behaves contrary to its creator’s intentions.

Since A1 and A2 will exhibit the same behavior during training, the possibility of GMG means that any model can take shape, even with a specification that rewards only the intended behavior. If A2 is trained, it will attempt to subvert human oversight to enact its plans toward an undesirable goal.

Our research team would like to see further work investigating the likelihood of GMG occurring in practice and possible remedies. In our paper we suggest some approaches, including mechanistic interpretability AND recurrent ratingwhich we are actively working on.

We are currently collecting GMG examples including publicly available spreadsheet. If you have encountered incorrect generalization of goals in AI research, we invite you to give examples here.

Categories

How Unwanted Goals Can Be Created with the Right Rewards

Five and overheating, most humanoid robots do not end the half -marathon in Beijing

Chatbot from customer service AI submitted the company’s rules – and created a mess

Zoom launches agency mobile AII messages for first line staff

When summer is approaching, federal cuts threaten the program to keep sensitive people in the frigid

Chatgpt will now exploit its “memory” to personalize internet search

More News

Start construction with Flash Gemini 2.5

Dolphingemma: How Google AI helps decoding dolphins communication

Generate movies in twins and beat with veo 2

Assessment of potential threats of advanced cyber security AI

Five and overheating, most humanoid robots do not end the half -marathon in Beijing

Chatbot from customer service AI submitted the company’s rules – and created a mess

Zoom launches agency mobile AII messages for first line staff