Friday, April 18, 2025

Choosing an Energetic Offline Policy

Share

Reinforcement learning (RL) has made huge strides in recent years in solving real-world problems—and offline RL has made it even more practical. Instead of interacting directly with the environment, we can now train multiple algorithms from a single, pre-recorded dataset. However, we lose the practical benefits of offline RL data efficiency when evaluating the rules in place.

For example, when training robotic manipulators, the robot’s resources are usually circumscribed, and training multiple rules by offline RL on a single dataset gives us a vast advantage in data efficiency compared to online RL. Evaluating each rule is an steep process that requires interacting with the robot thousands of times. When we choose the best algorithm, hyperparameters, and a series of training steps, the problem quickly becomes intractable.

To make RL more useful in real-world applications such as robotics, we propose to operate an wise evaluation procedure to select the policy to implement, called dynamic offline policy selection (A-OPS). In A-OPS, we operate a pre-recorded dataset and allow circumscribed interactions with the real environment to raise the quality of the selection.

To minimize interactions with the real environment, we implement three key features:

  1. Off-policy policy evaluation, such as Fitted Q-Evaluation (FQE), allows us to pre-estimate the performance of each policy based on an offline dataset. It correlates well with ground truth performance in many environments, including real-world robotics, where it is being applied for the first time.

The policy returns are modeled jointly using a Gaussian process, where the observations include the FQE results and a compact number of newly collected episodic returns from the robot. After evaluating one policy, we gain knowledge about all policies because their distributions are correlated by a kernel between pairs of policies. The kernel assumes that if policies take similar actions – such as moving the robot’s gripper in a similar direction – they tend to generate similar returns.

  1. To operate the data more efficiently, we apply Bayesian optimization and evaluate more promising strategies first, specifically those with high predicted performance and high variance.

We demonstrate this procedure in multiple environments across domains: dm-control, Atari, simulated, and real robotics. Using A-OPS quickly reduces regret, and with a moderate number of policy evaluations we identify the best policy.

Our results suggest that it is possible to make effective offline policy selection using only a compact number of interactions with the environment, by leveraging offline data, a special kernel, and Bayesian optimization. The code for A-OPS is open-source and available on GitHub with a sample dataset to try out.

Latest Posts

More News