Wednesday, March 11, 2026

Google ‘Watch & Learn’ platform removes data bottleneck for desktop agent training

Share

A fresh platform developed by researchers at Google Cloud and DeepMind aims to address one of the key challenges in developing computer usage agents (CUAs): collecting high-quality training examples at scale.

Framework, named Watch and learn (W&L), solves the problem of generating training data in a way that does not require human annotations and can automatically extract demonstrations from raw videos.

Their experiments show that the generated W&L data can be used to train or tune existing computer utilization models and baseline models to improve their performance on computer utilization tasks. But just as importantly, the same approach can be applied to creation contextual learning (ICL) examples for desktop agents, enabling companies to create CUAs for custom internal tasks without the need for pricey training of specialized models.

CUA data bottleneck

The web is wealthy with video tutorials and screencasts that describe the intricate workflows involved in using the application. These videos are a gold mine that can provide agents operating the computer with domain knowledge and instructions on how to perform various tasks through user interface interactions.

However, before they can be used to train CUA agents, these videos must be converted into annotated trajectories (i.e., a set of task descriptions, screenshots, and actions), a process that is prohibitively pricey and time-consuming when performed manually.

Existing approaches to solving the data bottleneck problem involve annotating videos using multimodal language models, which usually results in low precision and incorrect examples. Another approach uses self-playing agents that autonomously explore user interfaces to collect trajectories. However, techniques using this approach tend to produce basic examples that are not useful in unpredictable real-world situations.

As the researchers note in their paper: “Generally speaking, these approaches either rely on fragile heuristics, are expensive because they rely on explorations in real-world environments, or generate low-complexity demonstrations that are inconsistent with human intentions.”

Watch and learn

The Watch & Learn framework attempts to address the challenges of creating CUA demonstrations by rethinking the problem statement.

Rather than directly generating trajectories or relying on intricate, multi-step pipelines, researchers frame the problem as an “inverse dynamics goal”: given two consecutive observations, predict the intermediate action that caused the transition.

According to the researchers, this formulation is “easier to learn, avoids manually developed heuristics, and effectively generalizes across applications.”

The W&L framework can be broken down into three key steps: inverse dynamics model (IDM) training, raw video retrieval, and CUA agent training.

In the first phase, researchers used agents to interact with live web pages to create a gigantic set of 500,000 state transitions (two consecutive observations and the action that caused the transition). They then used this data (along with 132,000 human-annotated transitions from existing open datasets) to train an inverse dynamics model (IDM) that takes two consecutive observations and predicts the performance of the transition. Their trained IDM, which is a diminutive transformer model, outperformed off-the-shelf baseline models in predicting transient activities.

The researchers then designed a pipeline that takes videos from platforms such as YouTube and runs them through IDM to generate high-quality trajectories. IDM captures successive video frames and determines the actions (scrolling, clicking) that caused changes in the environment, which are then packaged into annotated trajectories. Using this method, 53,125 trajectories with highly right action labels were generated.

These examples can be used to train effective models for using computers for specific tasks. However, researchers also found that trajectories extracted using IDM can serve as examples of in-context learning, improving CUA performance on non-standard tasks at the point of inference. For ICL, they apply Gemini 2.5 Flash to add additional justification annotations to examples of trajectories observations/actions, which can then be inserted into CUA agent prompts (typically 3-5 examples) during inference.

“This dual role (training and contextual guidance) enables flexible integration with both open source models and general-purpose agents,” the researchers write.

W&L in action

To test the usefulness of W&L, researchers conducted a series of experiments with closed and open models on the platform OSWorld benchmarkthat evaluates agents in real-world desktop and operating system environments for a variety of tasks, including productivity, programming, and design.

For fine-tuning, they used a set of 53,000 trajectories to train two open-source models: UI-TARS-1.5, a mighty open-source vision, language, and action model designed specifically for desktop apply, and Qwen 2.5-VLopen-weighted multimodal LLM.

For context learning tests, they applied W&L examples to general-purpose multimodal models such as Gemini 2.5 Flash, OpenAI o3, and Claude Sonnet 4.

W&L resulted in improvements in OSWorld across all model categories, including up to 3 points for ICL on general-purpose models and up to 11 points on fine-tuned open source models.

More importantly, these benefits were achieved without any manual annotations, “demonstrating that web-scale human workflows can serve as a practical and scalable foundation for supporting CUA towards real-world implementation,” the researchers write.

This could have essential implications for real-world applications, enabling enterprises to transform existing corpora of videos and conference recordings into training data for CUA. It also makes it easier to generate fresh training trajectories. All you need to do is record videos of yourself performing various tasks and get them annotated by IDM. And as pioneer models continue to improve and become cheaper, you can expect to get more out of existing data and the field will continue to evolve.

Latest Posts

More News