QVPO

TL;DR: We propose a novel diffusion-based online RL algorithm, conducting policy optimization with Q-weighted variational loss and diffusion entropy regularization to exploit the expressiveness and exploration capability of diffusion policy.

Abstract

Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. It has been verified that utilizing diffusion policies can significantly improve the performance of RL algorithms in continuous control tasks by overcoming the limitations of unimodal policies, such as Gaussian policies, and providing the agent with enhanced exploration capabilities. However, existing works mainly focus on the application of diffusion policies in offline RL, while their incorporation into online RL has been less investigated. The training objective of the diffusion model, known as the variational lower bound, cannot be optimized directly in online RL due to the unavailability of 'good' samples (actions). This leads to difficulties in conducting diffusion policy improvement. To overcome this, we propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO). Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. To fulfill these conditions, the Q-weight transformation functions are introduced for general scenarios. Additionally, to further enhance the exploration capability of the diffusion policy, we design a special entropy regularization term. This entropy term is nontrivial because of the inaccessibility of the log-likelihood in diffusion policies. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions. Consequently, the QVPO algorithm leverages the exploration capabilities and multimodality of diffusion policies, preventing the RL agent from converging to a sub-optimal policy. To verify the effectiveness of QVPO, we conduct comprehensive experiments on MuJoCo continuous control benchmarks. The final results demonstrate that QVPO achieves state-of-the-art performance in terms of both cumulative reward and sample efficiency.

Visualization

Hopper-v3

Walker2d-v3

HalfCheetah-v3

Ant-v3

Humanoid-v3

Results

Citation

If you find this work useful in your research, please consider citing:

@article{ding2024diffusion, title={Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization}, author={Ding, Shutong and Hu, Ke and Zhang, Zhenhao and Ren, Kan and Zhang,
Weinan and Yu, Jingyi and Wang, Jingya and Shi, Ye}, journal={arXiv preprint arXiv:2405.16173}, year={2024} }

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization

Shutong Ding1,3, Ke Hu1, Zhenhao Zhang1, Kan Ren1,3, Weinan Zhang2, Jingyi Yu1,3, Jingya Wang1,3, Ye Shi1,3 1 ShanghaiTech University 2 Shanghai Jiao Tong University 3 MoE Key Laboratory of Intelligent Perception and Human Machine Collaboration

Abstract

Visualization

Results

Citation

Diffusion-based Reinforcement Learning via
Q-weighted Variational Policy Optimization

Shutong Ding^1,3, Ke Hu¹, Zhenhao Zhang¹, Kan Ren^1,3, Weinan Zhang², Jingyi Yu^1,3, Jingya Wang^1,3, Ye Shi^1,3

¹ ShanghaiTech University ² Shanghai Jiao Tong University ³ MoE Key Laboratory of Intelligent Perception
and Human Machine Collaboration