Sunday, April 20, 2025

Building Safer Dialogue Agents

Share

Tests

Published
Author’s

Sparrow’s Team

Training AI to communicate more helpfully, correctly, and harmlessly

In recent years, huge language models (LLMs) have achieved success in tasks such as question answering, summarizing, and dialogue. Dialogue is a particularly captivating task because it is characterized by malleable and interactive communication. However, LLM-powered dialogue agents may express wrong or invented information, exploit discriminatory language, or encourage threatening behavior.

To create safer dialog agents, we need to be able to learn from human feedback. Using reinforcement learning based on input from research participants, we are exploring up-to-date methods for training dialog agents that promise to make the system safer.

In our latest articlewe present Sparrow – a dialog agent that is useful and reduces the risk of threatening and inappropriate responses. Our agent is designed to talk to the user, answer questions, and search the Internet using Google when it is useful to find evidence to inform its responses.

Our up-to-date conversational AI model responds to initial human commands on its own.

Sparrow is a research and proof-of-concept model designed to train conversational agents to be more helpful, correct, and harmless. By learning these traits in a general dialogic environment, Sparrow advances our understanding of how we can train agents to be safer and more useful—and ultimately, to assist build safer and more useful artificial general intelligence (AGI).

Sparrow refuses to answer a potentially damaging question.

How Sparrow works

Training conversational AI is a particularly arduous problem because it’s difficult to pinpoint what makes a successful dialogue. To solve this problem, we turn to a form of reinforcement learning (RL) based on human feedback, using the preferences of study participants to train a model of how useful a response is.

To obtain this data, we show our participants multiple model responses to the same question and ask them which response they like best. Because we show responses with and without evidence taken from the Internet, this model can also determine when an answer should be supported by evidence.

We ask study participants to evaluate and interact with Sparrow, either naturally or antagonistically, continually expanding the dataset used to train Sparrow.

But improving usability is only part of the story. To make sure that the model’s behavior is secure, we need to constrain its behavior. So we establish an initial uncomplicated set of rules for the model, such as “don’t make threatening statements” and “don’t make hateful or offensive comments.”

We also provide rules about potentially harmful advice and not taking on a persona. These rules are developed by analyzing existing work on language harm and consulting with experts. We then ask study participants to talk to our system, with the goal of tricking it into breaking the rules. These conversations then allow us to train a separate “rule model” that indicates when Sparrow’s behavior breaks any of the rules.

Towards Better AI and Better Judgments

Even for experts, verifying the correctness of Sparrow’s answers is arduous. Instead, we ask our participants to determine whether Sparrow’s answers are plausible and whether the evidence Sparrow provides actually supports the answer. According to our participants, Sparrow gives a plausible answer and supports it with evidence 78% of the time when asked a factual question. This is a huge improvement over our baseline models. Still, Sparrow is not immune to making mistakes, such as hallucinating facts and giving answers that are sometimes off-topic.

Sparrow also has room for improvement in rule-following. After training, participants were still able to trick him into breaking our rules 8% of the time, but compared to simpler approaches, Sparrow is better at following our rules during antagonistic probing. For example, our original dialogue model broke the rules about 3 times more often than Sparrow when our participants tried to trick him.

Sparrow answers a question and asks another question, citing evidence, then follows the “Don’t pretend you have a human identity” rule when asked a personal question (sample from September 9, 2022).

Our goal with Sparrow was to build a malleable mechanism for enforcing rules and norms in dialogic agents, but the specific rules we exploit are preliminary. Developing a better and more complete set of rules will require both input from experts on many topics (including policymakers, sociologists, and ethicists) and participation from a variety of users and affected groups. We believe that our methods will still be applicable to a more stringent set of rules.

Sparrow is a significant step forward in understanding how to train dialogic agents to be more useful and secure. However, effective communication between humans and dialogic agents should not only avoid harm but also be consistent with human values ​​for effective and beneficial communication, as discussed in recent work on aligning language models with human values.

We also emphasize that a good agent will still refuse to answer questions in contexts where it is appropriate to give people a voice or where doing so may discourage harmful behavior. Finally, our initial research focused on an English-speaking agent, and further work is needed to ensure similar results in other languages ​​and cultural contexts.

We hope that in the future, conversations between humans and machines will allow for better assessment of AI behavior, allowing humans to adapt and improve systems that might be too complicated to understand without the assist of machines.

Want to explore a conversational path to secure AGI? We are here We are currently hiring research scientists for our Scalable Alignment team.

Latest Posts

More News