后悔
数学
上下界
常量(计算机编程)
指数函数
多武装匪徒
Kullback-Leibler散度
索引(排版)
贝叶斯概率
数学优化
时间范围
应用数学
组合数学
统计
数学分析
计算机科学
万维网
程序设计语言
标识
DOI:10.1017/s0001867800047790
摘要
We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O (log n ) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O (log n ) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.
科研通智能强力驱动
Strongly Powered by AbleSci AI