CGPO: Critic-Guided Diffusion Policy Optimization

TL;DR: CGPO integrates training-free critic guidance into the denoising process of diffusion policies, producing high-value regression targets that balance exploration and exploitation for online RL and real-world robot control.

Abstract

Recent advances in reinforcement learning have achieved strong performance by leveraging the multimodality and exploration capability of diffusion policies. Existing weighted policy optimization methods preserve diversity but can exploit critic information inefficiently, while gradient-based methods can over-concentrate the policy and reduce exploration. CGPO, Critic-Guided diffusion Policy Optimization, addresses this tradeoff by integrating a training-free guidance mechanism directly into diffusion-policy denoising.

CGPO steers action generation toward high-value regions defined by a learned critic and uses the guided actions as regression objectives. This design reduces the time required to obtain high-quality actions and improves final performance while preserving diversity. Across five MuJoCo locomotion tasks, CGPO achieves state-of-the-art performance among diffusion-based RL methods. It also demonstrates successful real-world reinforcement learning on Franka robotic grasping and insertion tasks.

Method

Critic-guided denoising

During actor updates, CGPO synthesizes a guided target action by applying critic guidance in the final denoising steps. Guidance is used only during training, so rollout and evaluation retain the standard unguided diffusion sampler.

Stable value learning

A value-calibrated network stabilizes critic-derived weights, while DDQN-style target construction and truncated-quantile aggregation reduce overestimation in the Q signal used for both weighting and guidance.

CGPO method diagram — CGPO performs critic-guided target generation during training, then fits the diffusion policy with weighted denoising regression.

Simulation Results

CGPO is evaluated on five MuJoCo v3 locomotion benchmarks for one million environment steps and compared against model-free and diffusion-based reinforcement learning baselines.

HalfCheetah-v3 learning curve — HalfCheetah-v3

Walker2d-v3 learning curve — Walker2d-v3

Humanoid-v3 learning curve — Humanoid-v3

Real-world Robot Learning

Franka Emika Panda experimental setup — Franka Emika Panda setup with multi-view RGB observations.

CGPO is deployed on a Franka Emika Panda arm with a Robotiq 2F-85 gripper under the HIL-SERL framework. The policy uses multi-view RGB images and proprioceptive states, and outputs Cartesian end-effector commands.

80%

success rate on cylindrical peg-in-hole evaluation, outperforming the SAC baseline by 15 percentage points.

Sequential frames from real-world evaluation tasks — Sequential frames from cube stacking and cylindrical peg-in-hole real-world evaluations.

BibTeX

@inproceedings{ding2026cgpo,
  title     = {Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance},
  author    = {Ding, Shutong and Zhong, Zejia and Wang, Zhongyi and Hu, Ke and Pan, Bikang and Wang, Jingya and Shi, Ye},
  booktitle = {International Conference on Machine Learning},
  year      = {2026}
}