Sunday, April 20, 2025

Early warning system for up-to-date threats related to artificial intelligence

Share

Responsibility and safety

Published
Author’s

Toby Shevlane

Novel research proposes a framework for assessing general-purpose models against emerging threats

To responsibly pioneer cutting-edge artificial intelligence (AI) research, we need to identify up-to-date opportunities and up-to-date threats in our AI systems as early as possible.

Artificial intelligence researchers already employ a number of them evaluation reference points to identify undesirable behavior in AI systems, such as AI systems making misleading statements, biased decisions, or repeating copyrighted content. Now, as the AI ​​community builds and deploys increasingly powerful AI, we need to expand our assessment portfolio to account for this possibility extreme risk from general-purpose artificial intelligence models that have high levels of manipulation, fraud, cybercrime, or other threatening capabilities.

In our latest articlewe present a framework for assessing these up-to-date threats, co-authored by colleagues from the University of Cambridge, the University of Oxford, the University of Toronto, the Université de Montréal, OpenAI, Anthropic, the Alignment Research Center, the Center for Long-Term Resilience, and the Center for Artificial Intelligence Governance.

Model safety assessments, including extreme risk assessments, will be a key part of the protected development and deployment of AI.

Overview of our proposed approach: To assess the extreme risks associated with up-to-date general-purpose AI systems, developers must evaluate threatening capabilities and customization (see below). By identifying threats early, this will open up opportunities for greater accountability when training up-to-date AI systems, deploying those AI systems, transparently describing the risks associated with them, and applying appropriate cybersecurity standards.

Assessment for extreme hazards

General-purpose models typically learn their capabilities and behaviors during training. However, existing methods of controlling the learning process are imperfect. For example, previous research at Google DeepMind examined how artificial intelligence systems can learn to pursue undesirable goals even when we properly reward them for good behavior.

Responsible AI developers must look to the future and anticipate possible future developments and up-to-date threats. With continuous progress, future general-purpose models may learn various threatening features by default. For example, it is likely (though uncertain) that future artificial intelligence systems will be able to conduct offensive cyber operations, skillfully deceive people in dialogue, manipulate people to carry out malicious actions, design or acquire weapons (e.g. biological, chemical), fine-tune and operate other high-risk AI systems on cloud computing platforms or assist humans with any of these tasks.

Malicious individuals accessing such models could do so abuse their possibilities. Or due to failures to adapt, AI models can take harmful actions even if no one intended it.

Model evaluation helps us identify these risks in advance. In our framework, AI developers would employ model evaluation to discover:

  1. The extent to which the model has certain “dangerous capabilities” that can be exploited to compromise security, influence, or evade surveillance.
  2. The extent to which the model is susceptible to its capabilities being exploited to cause harm (i.e. model alignment). Fit assessments should confirm that the model behaves as intended even under a very wide range of scenarios and, where possible, should examine the internal workings of the model.

The results of these assessments will facilitate AI developers understand whether sufficient ingredients are present to provide extreme risk. The riskiest cases will involve multiple threatening possibilities combined together. The AI ​​system does not need to provide all components, as shown in the diagram below:

Extreme risk components: Sometimes it is possible to outsource certain functions to humans (e.g. users or social workers) or other artificial intelligence systems. These possibilities should be applied in the event of damage resulting from misuse or incorrect alignment (or a combination of both).

Rule of thumb: The AI ​​community should treat an AI system as highly threatening if it has a capability profile sufficient to cause extreme harm. conceited is incorrectly used or incorrectly set. To implement such a system in the real world, an AI creator would have to demonstrate an extremely high standard of security.

Model evaluation as a critical management infrastructure

If we have better tools to identify which models are risky, companies and regulators will be better able to ensure:

  1. Responsible training: Responsible decisions are made about whether and how to train a up-to-date model that shows early signs of risk.
  2. Responsible implementation: : Responsible decisions are made about whether, when and how to implement potentially risky models.
  3. Transparency: Useful and actionable information is provided to stakeholders to facilitate them prepare for or mitigate potential risks.
  4. Appropriate safeguards: Forceful controls and information security systems are used in models that may pose extreme risks.

We have developed a blueprint for how model assessments for extreme risk should inform significant decisions about training and deploying a high-performance, general-purpose model. The developer is constantly conducting valuations and subsidies structured access to the model external security researchers and model auditors so they can drive additional marks The assessment results can then form the basis for risk assessment before model training and implementation.

A plan to incorporate model assessments for extreme hazards into significant decision-making processes during model training and deployment.

Looking to the future

Vital early Work on model assessments for extreme risk is already underway at Google DeepMind and elsewhere. However, much more progress is needed – both technical and institutional – to build an assessment process that takes into account all possible risks and helps protect against future emerging challenges.

Model evaluation is not a panacea; some risks may, for example, slip through the net because they are too dependent on factors external to the model, e.g. complex social, political and economic forces in society. Model assessment must be combined with other risk assessment tools and a broader commitment to safety across industry, government and civil society.

Google’s latest blog on responsible artificial intelligence states that “individual practices, common industry standards, and sound government policies would be essential to the proper implementation of AI.” We hope that many others working in AI and the sectors impacted by this technology will come together to create approaches and standards for safely developing and implementing AI for the benefit of all.

We believe that having processes in place to track the emergence of risky properties in models and respond appropriately to troubling results is a key element of being a responsible developer at the frontier of AI.

Latest Posts

More News