分位数
强化学习
分位数回归
CVAR公司
计算机科学
分位数函数
单调函数
功能(生物学)
贝尔曼方程
增强学习
数学优化
离线学习
人工智能
预期短缺
机器学习
计量经济学
在线学习
数学
累积分布函数
风险管理
统计
经济
概率密度函数
数学分析
生物
管理
进化生物学
万维网
作者
Chenjia Bai,Ting Xiao,Zhoufan Zhu,Lingxiao Wang,Fan Zhou,Animesh Garg,Bin He,Peng Liu,Zhaoran Wang
标识
DOI:10.1109/tnnls.2022.3217189
摘要
A key challenge in offline reinforcement learning (RL) is how to ensure the learned offline policy is safe, especially in safety-critical domains. In this article, we focus on learning a distributional value function in offline RL and optimizing a worst-case criterion of returns. However, optimizing a distributional value function in offline RL can be hard, since the crossing quantile issue is serious, and the distribution shift problem needs to be addressed. To this end, we propose monotonic quantile network (MQN) with conservative quantile regression (CQR) for risk-averse policy learning. First, we propose an MQN to learn the distribution over returns with non-crossing guarantees of the quantiles. Then, we perform CQR by penalizing the quantile estimation for out-of-distribution (OOD) actions to address the distribution shift in offline RL. Finally, we learn a worst-case policy by optimizing the conditional value-at-risk (CVaR) of the distributional value function. Furthermore, we provide theoretical analysis of the fixed-point convergence in our method. We conduct experiments in both risk-neutral and risk-sensitive offline settings, and the results show that our method obtains safe and conservative behaviors in robotic locomotion tasks.
科研通智能强力驱动
Strongly Powered by AbleSci AI