Tests
A up-to-date, formal definition of agency provides clear rules for causal modeling of AI agents and the incentives they face
We want to build unthreatening and consistent Artificial General Intelligence (AGI) systems that achieve the goals intended by their designers. Causal Influence Diagrams (CID) is a way of modeling decision-making situations that allows us to reason Incentives for AgentsFor example, below is the CID for a 1-step Markov decision process – a typical framework for decision-making problems.
S1 represents the initial state, A1 represents the agent’s decision (square), S2 the next state. R2 is the agent’s reward/utility (diamond). Solid links define causal influence. Dashed edges define information links – what the agent knows when making a decision.
By relating training configurations to the stimuli that shape agent behavior, CIDs facilitate illuminate potential risks before agent training and can inspire better agent designs. But how do we know when a CID is an true model of a training configuration?
Our up-to-date newspaper, Discovering Agentsintroduces up-to-date ways to solve these problems, including:
- First formal causal definition of agents: Agents are systems that adjust their policies if their actions affect the world in a different way
- An algorithm for discovering agents based on empirical data
- Translating Between Causal Models and CIDs
- Resolving previous ambiguities resulting from incorrect agent causal modeling
Together, these results provide additional assurance that no error was made in the modeling, meaning that CIDs can be used to analyze an agent’s incentives and safety factors with greater confidence.
Example: Modeling a Mouse as an Agent
To illustrate our method, consider the following example consisting of a world containing three squares, in which the mouse starts in the middle square, choosing to go left or right, reaching its next position and potentially earning some cheese. The floor is icy, so the mouse can slip. Sometimes the cheese is on the right, but sometimes it is on the left.
Mouse and cheese environment.
This can be represented by the following CID:
CID for the mouse. D denotes left/right decision. X denotes the up-to-date position of the mouse after the left/right action (it may slide, accidentally landing on the other side). U denotes whether the mouse will get the cheese or not.
The intuition that a mouse will choose a different behavior under different environmental conditions (icing, cheese distribution) can be captured by mechanized causal graphwhich for each variable (at the object level) also includes a mechanism variable that governs how the variable depends on its parents. Most importantly, we allow for associations between mechanism variables.
This graph includes additional mechanism nodes in black, representing mouse politics and the distribution of “glaciality” and “cheeseiness.”
Mechanized causal graph for the mouse and cheese environment.
The edges between mechanisms represent direct causal influence. Blue edges are special terminal edges – roughly, the edges of the mechanism A~ → B~ that would still be there even if the object-level variable A were changed so as to have no outgoing edges.
In the above example, since U has no children, its mechanism edge must be terminal. But the mechanism edge X~ → D~ is not terminal, because if we cut X off from its child U, the mouse will no longer adjust its decision (since its position will have no effect on whether it gets the cheese).
Discovering the causality of factors
Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from variable A to variable B by experimentally intervening on A and testing whether B responds, even when all other variables are held constant.
Our first algorithm uses this technique to discover a mechanized causal graph:
Algorithm 1 takes as input the intervention data from the system (mouse and cheese environment) and uses causal discovery to derive a mechanized causal graph. Details in the paper.
Our second algorithm transforms this mechanized causal graph into a game graph:
Algorithm 2 takes as input a mechanized causal graph and maps it to a game graph. The incoming tail edge indicates a decision, and the outgoing edge indicates a utility.
Together, Algorithm 1 and then Algorithm 2 allow us to discover agents from causal experiments by representing them using CIDs.
Our third algorithm transforms the game graph into a mechanized causal graph, which allows us to translate between the game and mechanized causal graph representations under some additional assumptions:
Algorithm 3 takes as input a game graph and maps it to a mechanized causal graph. Decision points to an incoming terminal edge, utility points to an outgoing terminal edge.
Better Security Tools for AI Agent Modeling
We have proposed the first formal causal definition of agents. Building on the causal discovery, our key insight is that agents are systems that adapt their behavior in response to changes in the way their actions affect the world. Indeed, our Algorithms 1 and 2 describe a precise experimental process that can facilitate assess whether a system contains an agent.
Interest in causal modeling of AI systems is growing rapidly, and our research grounds this modeling in causal discovery experiments. Our paper demonstrates the potential of our approach by improving the safety analysis of several example AI systems and shows that causality is a useful framework for discovering whether an agent is present in a system—a key issue in AGI risk assessment.
Want to learn more? Check out our paper. Feedback and comments are welcome.