Choosing an Energetic Offline Policy

Share

Reinforcement learning (RL) has made huge strides in recent years in solving real-world problems—and offline RL has made it even more practical. Instead of interacting directly with the environment, we can now train multiple algorithms from a single, pre-recorded dataset. However, we lose the practical benefits of offline RL data efficiency when evaluating the rules in place.

For example, when training robotic manipulators, the robot’s resources are usually circumscribed, and training multiple rules by offline RL on a single dataset gives us a vast advantage in data efficiency compared to online RL. Evaluating each rule is an steep process that requires interacting with the robot thousands of times. When we choose the best algorithm, hyperparameters, and a series of training steps, the problem quickly becomes intractable.

To make RL more useful in real-world applications such as robotics, we propose to operate an wise evaluation procedure to select the policy to implement, called dynamic offline policy selection (A-OPS). In A-OPS, we operate a pre-recorded dataset and allow circumscribed interactions with the real environment to raise the quality of the selection.

To minimize interactions with the real environment, we implement three key features:

Off-policy policy evaluation, such as Fitted Q-Evaluation (FQE), allows us to pre-estimate the performance of each policy based on an offline dataset. It correlates well with ground truth performance in many environments, including real-world robotics, where it is being applied for the first time.

The policy returns are modeled jointly using a Gaussian process, where the observations include the FQE results and a compact number of newly collected episodic returns from the robot. After evaluating one policy, we gain knowledge about all policies because their distributions are correlated by a kernel between pairs of policies. The kernel assumes that if policies take similar actions – such as moving the robot’s gripper in a similar direction – they tend to generate similar returns.

To operate the data more efficiently, we apply Bayesian optimization and evaluate more promising strategies first, specifically those with high predicted performance and high variance.

We demonstrate this procedure in multiple environments across domains: dm-control, Atari, simulated, and real robotics. Using A-OPS quickly reduces regret, and with a moderate number of policy evaluations we identify the best policy.

Our results suggest that it is possible to make effective offline policy selection using only a compact number of interactions with the environment, by leveraging offline data, a special kernel, and Bayesian optimization. The code for A-OPS is open-source and available on GitHub with a sample dataset to try out.

The AI Sckool

Categories

Choosing an Energetic Offline Policy

A modest screenshot can be the key to great AI assistants

SA ends the implementation of EMR in the whole condition and more briefs

Embarrassment is supposedly the key to the next Razr Motorola

Making a code generated by AI in any language

Google One AI Premium is free for students until spring 2026

More News

Start construction with Flash Gemini 2.5

Dolphingemma: How Google AI helps decoding dolphins communication

Generate movies in twins and beat with veo 2

Assessment of potential threats of advanced cyber security AI

A modest screenshot can be the key to great AI assistants

SA ends the implementation of EMR in the whole condition and more briefs

Embarrassment is supposedly the key to the next Razr Motorola