Model-Free Nonstationary Reinforcement Learning: Near-Optimal Regret and Applications in Multiagent Reinforcement Learning and Inventory Control

强化学习 后悔 钢筋 计算机科学 控制(管理) 人工智能 机器学习 心理学 社会心理学
作者
Weichao Mao,Kaiqing Zhang,Ruihao Zhu,David Simchi‐Levi,Tamer Başar
出处
期刊:Management Science [Institute for Operations Research and the Management Sciences]
卷期号:71 (2): 1564-1580 被引量:5
标识
DOI:10.1287/mnsc.2022.02533
摘要

We consider model-free reinforcement learning (RL) in nonstationary Markov decision processes. Both the reward functions and the state transition functions are allowed to vary arbitrarily over time as long as their cumulative variations do not exceed certain variation budgets. We propose Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free algorithm for nonstationary RL, and show that it outperforms existing solutions in terms of dynamic regret. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of [Formula: see text], where S and A are the numbers of states and actions, respectively, [Formula: see text] is the variation budget, H is the number of time steps per episode, and T is the total number of time steps. We further present a parameter-free algorithm named Double-Restart Q-UCB that does not require prior knowledge of the variation budget. We show that our algorithms are nearly optimal by establishing an information-theoretical lower bound of [Formula: see text], the first lower bound in nonstationary RL. Numerical experiments validate the advantages of RestartQ-UCB in terms of both cumulative rewards and computational efficiency. We demonstrate the power of our results in examples of multiagent RL and inventory control across related products. This paper was accepted by Omar Besbes, revenue management and market analytics. Funding: The research of D. Simchi-Levi and R. Zhu was supported by the MIT Data Science Laboratory. The research of W. Mao, K. Zhang, and T. Başar was supported in part by the U.S. Army Research Laboratory (ARL) Cooperative Agreement W911NF-17-2-0196, in part by the Office of Naval Research (ONR) [MURI Grant N00014-16-1-2710], and in part by the Air Force Office of Scientific Research (AFOSR) [Grant FA9550-19-1-0353]. K. Zhang also acknowledges support from U.S. Army Research Laboratory (ARL) [Grant W911NF-24-1-0085]. Supplemental Material: The online appendix and data files are available at https://doi.org/10.1287/mnsc.2022.02533 .
最长约 10秒,即可获得该文献文件

科研通智能强力驱动
Strongly Powered by AbleSci AI
科研通是完全免费的文献互助平台,具备全网最快的应助速度,最高的求助完成率。 对每一个文献求助,科研通都将尽心尽力,给求助人一个满意的交代。
实时播报
jasigfhaig发布了新的文献求助10
刚刚
Chem发布了新的文献求助30
1秒前
研友_ZAVod8发布了新的文献求助10
1秒前
水晶发布了新的文献求助10
1秒前
陈晓倩完成签到,获得积分10
1秒前
曦晨完成签到,获得积分20
1秒前
2秒前
cz完成签到,获得积分10
2秒前
上官若男应助橙子采纳,获得10
4秒前
金克丝发布了新的文献求助30
4秒前
4秒前
4秒前
子非我完成签到,获得积分10
4秒前
慕青应助eguydqdw采纳,获得10
4秒前
4秒前
无私惜雪发布了新的文献求助10
4秒前
5秒前
YKT发布了新的文献求助10
5秒前
KK完成签到,获得积分10
6秒前
Hello应助hhh采纳,获得10
6秒前
orixero应助张jiu采纳,获得10
6秒前
7秒前
共享精神应助江峰采纳,获得10
7秒前
8秒前
英俊的铭应助江风采纳,获得10
8秒前
FashionBoy应助mark采纳,获得10
8秒前
Lucas应助专注的草丛采纳,获得10
8秒前
常温可乐发布了新的文献求助10
8秒前
8秒前
乐乐应助麻花采纳,获得10
8秒前
9秒前
9秒前
JRF关闭了JRF文献求助
9秒前
orixero应助练习者采纳,获得10
10秒前
10秒前
10秒前
11秒前
小圆完成签到,获得积分10
11秒前
FashionBoy应助QYG采纳,获得10
12秒前
lll发布了新的文献求助10
12秒前
高分求助中
Malcolm Fraser : a biography 700
Signals, Systems, and Signal Processing 610
天津市智库成果选编 600
Climate change and sports: Statistics report on climate change and sports 500
Forced degradation and stability indicating LC method for Letrozole: A stress testing guide 500
Organic Reactions Volume 118 400
A Foreign Missionary on the Long March: The Unpublished Memoirs of Arnolis Hayman of the China Inland Mission 400
热门求助领域 (近24小时)
化学 材料科学 医学 生物 纳米技术 工程类 有机化学 化学工程 生物化学 计算机科学 物理 内科学 复合材料 催化作用 物理化学 光电子学 电极 细胞生物学 基因 无机化学
热门帖子
关注 科研通微信公众号,转发送积分 6464736
求助须知:如何正确求助?哪些是违规求助? 8271889
关于积分的说明 17636658
捐赠科研通 5538115
什么是DOI,文献DOI怎么找? 2907458
邀请新用户注册赠送积分活动 1884452
关于科研通互助平台的介绍 1731685