How do we protect internal systems against increasingly capable and imperfectly adapted artificial intelligence
AI agents are changing our relationship with technology. Autonomously performing sophisticated tasks – from cyber defense to scientific discovery and product development – these systems open a novel era of productivity. In the US alone, AI agents could earn $2.90 trillions in economic value by 2030.
As agents become more capable, they also require more sophisticated security. That’s why we developed ours AI control roadmap: a framework for building and managing the advanced AI we deploy at Google. This “defense in depth” approach, which could serve as a model for the broader industry, goes beyond classic model matching by adding a key layer of system-level security that provides confidence even when alignment is imperfect.
Understanding AI control
Our approach to security starts with a forceful foundation, including classic security features such as sandboxing, endpoint security, and rapid injection resistance. Moreover, the AI control roadmap uses model adaptation, i.e. training the AI to be inherently secure and helpful, as its main defense. It provides an additional layer of security by treating internal agents as potentially misaligned, providing confidence even if the match is imperfect.
Think of it like a dual-control driving instructor. The instructor trusts the student but remains ready to take over the wheel or hit the brake if a mistake occurs. Similarly, our AI control system grants permissions to AI agents based on their verified behavior, allowing us to build trust through controlled, incremental access.
Our roadmap is based on a elementary but cautious assumption: What happens if a highly capable AI agent is inconsistent with our goals and begins to act in unexpected ways? To address this, our plan focuses on three key areas.
First, guided by cybersecurity best practices, we have developed a novel threat modeling framework for artificial intelligence. The action plan treats untrusted AI agents as potential “insider threats” – much like a company would treat a potential rogue employee who already has access to the office. Based on industry standards AT&CK DITS framework, a globally available knowledge base about enemy tactics and techniques, based on real-world observations, we divide potential attacks into smaller tactics and techniques. This allows us to methodically track risk, detect bad behavior before it causes damage, and safely conduct drills to see if our defenses actually hold up.
Second, we are developing a plan to implement AI control mitigation measures to address these threats.
- Problem identification (detection): Monitoring is the basis of AI control. We employ other trusted AI systems as “supervisors” that constantly check an employee’s reasoning, actions and plans to ensure they are on track.
- Problem management (prevention and response): If a supervisor sees an agent trying to do something harmful, they can step in to block the action before damage occurs.
Finally, to ensure the functionality of our systems, we continually measure performance based on three critical metrics: reach (the portion of traffic monitored), recall (the portion of abnormal behavior detected), and response time.
Scaling security as AI gets smarter
As AI models continue to evolve, our defenses must also become stronger. In our AI roadmap, we map security protocols to measurable milestones in AI capabilities on two key fronts:
- Ability to avoid detection: Today we monitor the model’s observable train of thought. However, as models learn to hide their reasoning—either by (knowing they are being monitored) or (reasoning without using observable text)—merely reading their verbalized reasoning is not enough. We will need to analyze their behavior and potentially check the inner workings of the model.
- Capacity for potential harm: For low-risk and reversible activities, we can rely on reviewing transcripts with a delay and correcting problems quickly after they occur (asynchronous response). For particularly risky activities, such as major cyber attacks, we need to move to real-time prevention, blocking malicious activities before they can be carried out (synchronous response). In other words, our response strategy must scale based on the severity of the threat.
