Mohammed Amine Bennouna,Dessislava A. Pachamanova,Georgia Perakis,Omar Skali Lami
出处
期刊:Management Science [Institute for Operations Research and the Management Sciences] 日期:2024-09-26
标识
DOI:10.1287/mnsc.2022.01652
摘要
This paper proposes a framework for learning the most concise Markov decision process (MDP) model of a continuous state-space dynamic system from observed transition data. This setting is encountered in numerous important applications, such as patient treatment, online advertising, recommender systems, and estimation of treatment effects in econometrics. Most existing methods in offline reinforcement learning construct functional approximations of the value or the transition and reward functions, requiring complex and often not interpretable function approximators. Our approach instead relies on partitioning the system’s observation space into regions constituting states of a finite MDP representing the system. We discuss the theoretically minimal MDP representation that preserves the values and, therefore, the optimal policy of the dynamic system—in a sense, the optimal discretization. We formally define the problem of learning such a concise representation from transition data without exploration. Learning such a representation allows for enhanced tractability and, importantly, provides interpretability. To solve this problem, we introduce an in-sample property on partitions of the observation space we name coherence, and we show that if the class of possible partitions is of finite Vapnik-Chervonenkis dimension, any coherent partition with the transition data converges to the minimal representation of the system with provable finite-sample probably approximately correct convergence guarantees. This insight motivates our minimal representation learning algorithm that constructs from transition data an MDP representation that approximates the minimal representation of the system. We illustrate the effectiveness of the proposed framework through numerical experiments in both deterministic and stochastic environments as well as with real data. This paper was accepted by Chung Piaw Teo, optimization. Funding: The authors are very grateful to the Health Systems Initiative at MIT Sloan for financial support for this project. Supplemental Material: The online appendix is available at https://doi.org/10.1287/mnsc.2022.01652 .