后悔
计算机科学
集合(抽象数据类型)
上下界
数学优化
人工智能
机器学习
数学
数学分析
程序设计语言
作者
Zhengyuan Zhou,Renyuan Xu,José Blanchet
出处
期刊:Neural Information Processing Systems
日期:2019-09-06
卷期号:32: 5197-5208
被引量:35
摘要
In this paper, we consider online learning in generalized linear contextual bandits where rewards are not immediately observed. Instead, rewards are available to the decision maker only after some delay, which is unknown and stochastic, even though a decision must be made at each time step for an incoming set of contexts. We study the performance of upper confidence bound (UCB) based algorithms adapted to this delayed setting. In particular, we design a delay-adaptive algorithm, which we call Delayed UCB, for generalized linear contextual bandits using UCB-style exploration and establish regret bounds under various delay assumptions. In the important special case of linear contextual bandits, we further modify this algorithm and establish a tighter regret bound under the same delay assumptions. Our results contribute to the broad landscape of contextual bandits literature by establishing that UCB algorithms, which are widely deployed in modern recommendation engines, can be made robust to delays.
科研通智能强力驱动
Strongly Powered by AbleSci AI