强化学习
马尔可夫决策过程
可见的
计算机科学
部分可观测马尔可夫决策过程
国家(计算机科学)
马尔可夫过程
状态空间
数学优化
人工智能
班级(哲学)
数学
算法
统计
物理
量子力学
作者
Satinder Singh,Tommi Jaakkola,Michael I. Jordan
出处
期刊:Elsevier eBooks
[Elsevier]
日期:1994-01-01
卷期号:: 284-292
被引量:291
标识
DOI:10.1016/b978-1-55860-335-6.50042-8
摘要
Reinforcement learning (RL) algorithms provide a sound theoretical basis for building learning control architectures for embedded agents. Unfortunately all of the theory and much of the practice (see Barto et al., 1983, for an exception) of RL is limited to Markovian decision processes (MDPs). Many real-world decision tasks, however, are inherently non-Markovian, i.e., the state of the environment is only incompletely known to the learning agent. In this paper we consider only partially observable MDPs (POMDPs), a useful class of non-Markovian decision processes. Most previous approaches to such problems have combined computationally expensive state-estimation techniques with learning control. This paper investigates learning in POMDPs without resorting to any form of state estimation. We present results about what TD(0) and Q-learning will do when applied to POMDPs. It is shown that the conventional discounted RL framework is inadequate to deal with POMDPs. Finally we develop a new framework for learning without state-estimation in POMDPs by including stochastic policies in the search space, and by defining the value or utility of a distribution over states.
科研通智能强力驱动
Strongly Powered by AbleSci AI