时差学习
人工神经网络
次线性函数
强化学习
趋同(经济学)
分歧(语言学)
非线性系统
贝尔曼方程
计算机科学
功能(生物学)
增强学习
全局优化
数学优化
数学
应用数学
人工智能
经济
物理
数学分析
哲学
语言学
生物
进化生物学
量子力学
经济增长
作者
Qi Cai,Zhuoran Yang,Jason D. Lee,Zhaoran Wang
出处
期刊:Cornell University - arXiv
日期:2019-05-24
被引量:17
标识
DOI:10.48550/arxiv.1905.10027
摘要
Temporal-difference learning (TD), coupled with neural networks, is among the most fundamental building blocks of deep reinforcement learning. However, due to the nonlinearity in value function approximation, such a coupling leads to nonconvexity and even divergence in optimization. As a result, the global convergence of neural TD remains unclear. In this paper, we prove for the first time that neural TD converges at a sublinear rate to the global optimum of the mean-squared projected Bellman error for policy evaluation. In particular, we show how such global convergence is enabled by the overparametrization of neural networks, which also plays a vital role in the empirical success of neural TD. Beyond policy evaluation, we establish the global convergence of neural (soft) Q-learning, which is further connected to that of policy gradient algorithms.
科研通智能强力驱动
Strongly Powered by AbleSci AI