Thursday, April 23, 2026

Protecting people against harmful manipulation

Share

As AI models become better at driving natural conversations, we need to examine how these interactions impact people and society.

Drawing on a wide range of scientific research, we publish today new arrangements on the possible misuse of artificial intelligence harmful manipulation*in particular, its ability to change people’s thoughts and behaviors in negative and dishonest ways. With this latest study, we have created the first empirically validated toolkit to measure this type of AI manipulation in the real world, which we hope will aid protect people and advance the field as a whole. We publicly publish all materials necessary to conduct research involving human participants using the same methodology. (

Why harmful manipulation matters

Consider two scenarios: One AI model provides facts to aid you make informed health care decisions that will improve your well-being. Another model of artificial intelligence uses fear to pressure you into making rash decisions that harm your health. The first educates and helps; the other deceives and harms.

These scenarios highlight the difference between two types of persuasion in human-AI interactions (also defined in previous research):

  • Beneficial (rational) persuasion: Using facts and evidence to aid people make choices that are consistent with their own interests
  • Harmful manipulation: Exploiting emotional and cognitive vulnerabilities to trick people into making harmful choices

Our latest work helps us and the broader AI community better understand the risk of AI developing malicious manipulation capabilities and build a scalable assessment framework to measure this complicated area. To do this effectively, we simulated misuse in high-stakes environments, explicitly encouraging the AI ​​to attempt to negatively manipulate people’s beliefs and behaviors on key issues.

Developing novel assessments for a complicated challenge

Testing the results of malicious AI manipulations

Testing for malicious manipulation is inherently tough because it involves measuring subtle changes in the way people think and act, which vary greatly across topics, cultures, and contexts.

This is what motivated our latest study, which involved conducting nine studies involving over 10,000 participants from the UK, US and India. We focused on high-stakes areas such as finance, where we used simulated investment scenarios to test whether AI could influence human behavior in complicated decision-making environments, and health, where we tested whether AI could influence the dietary supplements people preferred. Interestingly, AI was the least effective at harmfully manipulating participants on health-related issues.

Our findings show that success in one domain does not translate to success in another, supporting our targeted approach to testing malicious manipulations in specific high-stakes environments where AI may be misused.

How can artificial intelligence manipulate?

In addition to tracking effectiveness (whether the AI ​​effectively changes its mind), we also measured its propensity (how often it uses manipulative tactics at all). We tested for propensity in two scenarios: when we explicitly told the model he was manipulating and when we did not.

As detailed in our researchwe counted manipulative tactics in the experiment transcripts, confirming that the AI ​​models were most manipulative when explicitly instructed to do so.

Our results also suggest that certain manipulative tactics may be more likely to result in harmful outcomes, although further research is needed to understand these mechanisms in detail.

By measuring both effectiveness and propensity, we can better understand how AI manipulation works and develop more targeted countermeasures.

Latest Posts

More News