A Faster, Better Way to Prevent AI Chatbots from Giving Toxic Responses

A user could ask ChatGPT to write a computer program or a summary of an article, and the AI chatbot would probably be able to generate useful code or write a coherent synopsis. But someone could also ask for instructions on how to build a bomb, and the chatbot could provide those as well.

To prevent this and other security issues, companies that build vast language models typically secure them using a process called red-teaming. Teams of human testers write prompts designed to trigger risky or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.

But this only works effectively if engineers know which toxic hints to exploit. If human testers miss some of the hints, which is likely given the number of possibilities, a chatbot that is considered protected may still be able to generate risky responses.

Researchers at MIT’s Improbable AI Lab and the MIT-IBM Watson AI Lab have used machine learning to improve red-teaming. They’ve developed a technique for training a vast red-team language model to automatically generate different prompts that trigger a wider range of undesirable responses from the chatbot being tested.

They do this by teaching the red team model to be curious when writing commands and to focus on recent commands that elicit toxic responses in the target model.

This technique outperformed human testers and other machine learning approaches by generating more explicit prompts that elicited increasingly toxic responses. Their method not only significantly improves the range of inputs tested compared to other automated methods, but it can also extract toxic responses from a chatbot that had built-in safeguards from human experts.

“Currently, any large language model needs to go through a very long period of red-teaming to ensure its security. This will not be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more efficient way to perform this quality control,” says Zhang-Wei Hong, a graduate student in electrical engineering and computer science (EECS) at Improbable AI and lead author an article about this red teaming approach.

Hong’s co-authors include EECS graduate students Idan Shenfield, Tsun-Hsuan Wang, and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and head of the Spoken Language Systems Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of the Improbable AI Lab and an assistant professor at CSAIL. The research will be presented at the International Conference on Learning Representations.

Automated red teaming

Huge language models, such as those that power AI chat bots, are often trained by showing them massive amounts of text from billions of public websites. Not only can this teach them to generate toxic words or describe illegal activities, but the models can also reveal personal information they may have collected.

The tedious and exorbitant nature of human red-teaming, which is often ineffective at generating a wide enough range of prompts to fully secure the model, has encouraged researchers to automate the process using machine learning.

Such techniques often train the red-team model using reinforcement learning. This trial-and-error process rewards the red-team model for generating cues that trigger toxic responses from the chatbot being tested.

However, because of the way reinforcement learning works, the red team model will often generate several similar, highly toxic commands in order to maximize reward.

In their approach to reinforcement learning, the MIT researchers used a technique called curiosity-driven exploration. The red team model is encouraged to be curious about the consequences of each command it generates, so it will try commands with different words, sentence patterns, or meanings.

“If the red team model has already seen a particular prompt, replaying it won’t spark any curiosity in them, so they’ll be forced to create new prompts,” Hong says.

During the training process, the red-team model generates a prompt and interacts with the chatbot. The chatbot responds, and the safety classifier evaluates the toxicity of its response, rewarding the red-team model based on this evaluation.

Rewarding curiosity

The goal of the red-team model is to maximize reward by inducing an even more toxic response with a recent stimulus. Scientists enable curiosity in the red-team model by modifying the reward signal in a reinforcement learning setup.

First, in addition to maximizing toxicity, they include an entropy bonus that encourages the red-team model to be more random when exploring different hints. Second, to pique the agent’s curiosity, they include two novelty rewards. One rewards the model based on word similarity in its hints, and the other rewards the model based on semantic similarity. (Lower similarity yields a higher reward.)

To prevent the red team model from generating random, nonsensical text that could trick the classifier into assigning a high toxicity score, the researchers also added a naturalistic language bonus to the training target.

With these additions, the researchers compared the toxicity and diversity of responses generated by their red-team model with other automated techniques. Their model outperformed the baselines in both metrics.

They also used their red-team model to test a chatbot that had been tuned with human feedback to avoid toxic responses. Their curiosity-based approach was able to quickly generate 196 prompts that elicited toxic responses from this “safe” chatbot.

“We are seeing an increase in the number of models that is only going to continue to grow. Imagine thousands of models or more, with companies/labs regularly rolling out model updates. These models will become an integral part of our lives and it is important that they are validated before being released for public consumption. Manual model validation simply does not scale, and our work is an attempt to reduce human effort to ensure a safer and more trustworthy AI future,” says Agrawal.

In the future, the researchers want to enable the red-team model to generate suggestions for a wider range of topics. They also want to explore the possibility of using a vast language model as a toxicity classifier. In this way, a user could train the toxicity classifier using, for example, a company policy document, so the red-team model could test the chatbot for policy violations.

“If you’re bringing a new AI model to market and you’re concerned about whether it will perform as expected, consider curiosity-driven red-teaming,” Agrawal says.

This research is funded in part by Hyundai Motor Company, Quanta Computer Inc., MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the U.S. Army Office of Research, the U.S. Defense Advanced Research Projects Agency’s Machine Common Sense Program, the U.S. Office of Naval Research, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator.

Categories

A Faster, Better Way to Prevent AI Chatbots from Giving Toxic Responses

The up-to-date tool evaluates the progress of learning to strengthen

Oura adds glucose tracking and recording AI powered meals

Meet the up-to-date King of Ai Coding: Google’s Gemini 2.5 Pro i/O Edition Dethrones Claude 3.7 Sonnet

Windows 11 receives a modern Start menu this month

Amazon had a job with a sense of touch

More News

The up-to-date tool evaluates the progress of learning to strengthen

Hybrid craftsmanship glossy, high -quality movies in seconds

Questions and answers: Road map of revolutionizing healthcare through innovations based on data

An pioneering AI model inspired by neural dynamics from the brain

The up-to-date tool evaluates the progress of learning to strengthen

Oura adds glucose tracking and recording AI powered meals

Meet the up-to-date King of Ai Coding: Google’s Gemini 2.5 Pro i/O Edition Dethrones Claude 3.7 Sonnet