In our last paperpublished in Nature Human Behavior, we present proof of concept that deep reinforcement learning (RL) can be used to find economic policies that people will vote for by a majority in a basic game. The paper thus addresses a key challenge in AI research – how to train AI systems that are aligned with human values.
Imagine that a group of people decide to pool their funds to make an investment. The investment pays off and a profit is made. How should the proceeds be distributed? One basic strategy is to divide the return equally between the investors. But this can be unfair, since some have contributed more than others. Alternatively, we could repay each in proportion to the size of their initial investment. This sounds fair, but what if the people had different asset levels to begin with? If two people contribute the same amount, but one gives a fraction of their available funds and the other gives all of them, should they receive the same share of the proceeds?
This question of how to redistribute resources in our economies and societies has long been a controversial one among philosophers, economists, and political scientists. Here we exploit deep RL as a testing ground to explore ways to address this problem.
To address this challenge, we created a basic game played by four players. Each instance of the game consisted of 10 rounds. In each round, each player was allocated funds, the amount of which varied between players. Each player made a choice: they could keep these funds for themselves or invest them in a common pool. The invested funds were guaranteed to grow, but there was risk because players did not know how the income would be distributed. Instead, they were told that for the first 10 rounds there was one judge (A) who made redistribution decisions, and for the next 10 rounds a different judge (B). At the end of the game, they voted for A or B and played another game with that judge. The human players were allowed to keep the income from this final game, so they were encouraged to be specific about their preferences.
In fact, one of the judges was a predefined redistribution policy, and the other was designed by our deep RL agent. To train the agent, we first recorded data from a gigantic number of human groups and trained a neural network to copy how humans played the game. This simulated population could generate unlimited data, allowing us to exploit data-intensive machine learning methods to train the RL agent to maximize the votes of these “virtual” players. After doing this, we recruited novel human players and exposed the AI-designed mechanism directly to well-known benchmarks such as libertarian a policy that returns funds to people in proportion to their contributions.
When we studied the votes of these novel players, we found that the policy designed by deep RL was more popular than the baselines. In fact, when we ran a novel experiment asking a fifth human player to take on the role of a judge and trained him to try to maximize votes, the policy implemented by this “human judge” was still less popular than our agent’s policy.
AI systems have sometimes been criticized for learning policies that may be inconsistent with human values, and this “value matching” problem has become a central concern in AI research. One advantage of our approach is that the AI learns directly to maximize the stated preferences (or votes) of a group of people. This approach can support ensure that AI systems are less likely to learn policies that are unsafe or unfair. In fact, when we analyzed the policies that the AI discovered, it incorporated a mix of ideas that had previously been proposed by human thinkers and experts to address the redistribution problem.
Firstly, the artificial intelligence decided to redistribute resources among people in proportion to their relative instead absolute contribution. That is, when redistributing funds, the agent took into account each player’s initial funds as well as their willingness to contribute. Second, the AI system particularly rewarded players whose relative contributions were more generous, perhaps encouraging others to do the same. Importantly, the AI only discovered these rules as it learned to maximize human votes. This method thus ensures that humans remain “in the loop” and that the AI creates solutions that are compatible with humans.
By asking people to vote, we used the principle of majoritarian democracy to decide what people want. Despite its wide appeal, it is widely accepted that democracy comes with the caveat that the preferences of the majority are weighed against the preferences of the minority. In our study, we ensured that, as in most societies, the minority consisted of more well-endowed players. However, more work is needed to understand how to trade off the relative preferences of majority and minority groups by designing democratic systems that allow all voices to be heard.