计算机科学
强化学习
水准点(测量)
机器学习
重新使用
基线(sea)
人工智能
汤普森抽样
采样(信号处理)
样品(材料)
后悔
滤波器(信号处理)
地理
化学
地质学
海洋学
生物
色谱法
计算机视觉
生态学
大地测量学
作者
Ximing Liu,Tianqing Zhu,Cuiqing Jiang,Dayong Ye,Fuqing Zhao
标识
DOI:10.1016/j.eswa.2021.116023
摘要
Experience replay has been widely used in deep reinforcement learning. The learning algorithm allows online reinforcement learning agents to remember and reuse experiences from the past. In order to further improve the sampling efficiency for experience replay, the most useful experiences are expected to be sampled with higher frequency. Existing methods usually designed their sampling strategy according to a few criteria, but they tended to combine different criteria in a linear or fixed manner, where the strategy were static and independent of the agent learner. This ignores the dynamic attribute of the environment and thus can only lead to a suboptimal performance. In this work, we propose a dynamic experience replay strategy according to the interaction between the agent and environment, which is called Prioritized Experience Replay based on Multi-armed Bandit (PERMAB). PERMAB can adaptively combine multiple priority criteria to measure the importance of the experience. In particular, the weight of each assessing criterion can be adaptively adjusted from episode to episode according to their respective contribution to the agent performance, which guarantees useful criterion to be weighted more in its current state. The proposed replay strategy is able to take both sample informativeness and diversity into consideration, which could significantly boosts learning ability and speed of the game agent. Experimental results show that PERMAB accelerates the network learning and achieves a better performance compared to baseline algorithms on seven benchmark environments with various difficulties.
科研通智能强力驱动
Strongly Powered by AbleSci AI