集合(抽象数据类型)
计算机科学
强化学习
任务(项目管理)
班级(哲学)
功能(生物学)
网格
构造(python库)
订单(交换)
光学(聚焦)
单调函数
人工智能
数学
数学分析
程序设计语言
管理
经济
几何学
物理
光学
生物
进化生物学
财务
作者
Tom Zahavy,Andre Noll Barreto,Daniel J. Mankowitz,Shaobo Hou,Brendan O'Donoghue,Iurii Kemaev,Satinder Singh
出处
期刊:Cornell University - arXiv
日期:2021-05-03
被引量:3
摘要
We study the problem of how to construct a set of policies that can be composed together to solve a collection of reinforcement learning tasks. Each task is a different reward function defined as a linear combination of known features. We consider a specific class of policy compositions which we call set improving policies (SIPs): given a set of policies and a set of tasks, a SIP is any composition of the former whose performance is at least as good as that of its constituents across all the tasks. We focus on the most conservative instantiation of SIPs, set-max policies (SMPs), so our analysis extends to any SIP. This includes known policy-composition operators like generalized policy improvement. Our main contribution is an algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. The algorithm works by successively adding new policies to the set. We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. We empirically evaluate our algorithm on a grid world and also on a set of domains from the DeepMind control suite. We confirm our theoretical results regarding the monotonically improving performance of our algorithm. Interestingly, we also show empirically that the sets of policies computed by the algorithm are diverse, leading to different trajectories in the grid world and very distinct locomotion skills in the control suite.
科研通智能强力驱动
Strongly Powered by AbleSci AI