Stable Exploration via Imitating Highly Scored Episode-Decayed Exploration Episodes in Procedurally Generated Environments

过度拟合排名（信息检索）计算机科学模仿人工智能集合（抽象数据类型）机器学习心理学人工神经网络神经科学程序设计语言

作者

Mao Xu,Shuzhi Sam Ge,Dongjie Zhao,Qian Zhao

出处

期刊：IEEE Transactions on Cognitive and Developmental Systems [Institute of Electrical and Electronics Engineers]
日期：2023-12-11 卷期号：16 (3): 1121-1133 被引量：1

标识

DOI：10.1109/tcds.2023.3339215

摘要

Exploring procedurally-generated environments is a formidable challenge for model-free deep reinforcement learning (DRL). One of the state-of-the-art exploration methods, exploration via ranking the episodes (RAPID), assigns episode-level episodic exploration scores to past episodes and makes the DRL agent imitate exploration behaviors from the highly-scored episodes. However, in complex procedurally-generated environments, such continued imitation can hinder RAPID's performance due to the emergence of solidified episodes, i.e., episodes that remain in the highly-scored episode set due to their high scores. These solidified episodes can lead the RAPID DRL agent to overfit, hindering its exploration and performance. To address this, we design an episode-decayed exploration score, which combines the episodic exploration score and an episodic decay factor, to avoid solidifying highly-scored episodes and aid in selecting good exploration episodes. Leveraging this score, we propose exploration via imitating highly-scored episode-decayed exploration episodes (EDEE), an effective and stable exploration method for procedurally-generated environments. EDEE assigns episode-decayed exploration scores to past episodes and stores the highly-scored episodes as good exploration episodes in a small ranking buffer. The DRL agent then imitates good exploration behaviors sampled from this ranking buffer through the exploration-based sampling to reproduce these good exploration behaviors from good exploration episodes. Extensive experiments on procedurally-generated environments, specifically MiniGrid and 3D maze from MiniWorld, and sparse MuJoCo environments show that EDEE significantly outperforms RAPID in terms of final performance and sample efficiency in complex procedurally-generated environments and sparse continuous environments. Moreover, even without extrinsic rewards, EDEE maintains excellent performance in procedurally-generated environments.

求助该文献

最长约 10秒，即可获得该文献文件

Stable Exploration via Imitating Highly Scored Episode-Decayed Exploration Episodes in Procedurally Generated Environments

今日热心研友