库存控制
计算机科学
样品(材料)
控制(管理)
运营管理
运筹学
经济订货量
数学优化
业务
经济
人工智能
数学
供应链
营销
色谱法
化学
作者
Fan Xiaoyu,Boxiao Chen,Tava Lennon Olsen,Hanzhang Qin,Zhengyuan Zhou
标识
DOI:10.1177/10591478251378851
摘要
We study the sample complexity of offline learning for a class of structured MDPs describing the inventory control system with fixed ordering cost/setup cost, a fundamental problem in supply chains. We find that a naive plug-in sampling-based approach applied to the inventory MDPs leads to strictly lower sample complexity bounds compared to the optimal bounds recently obtained for the general MDPs. More specifically, in the infinite-horizon discounted cost setting, we obtain an O ~ ( min { ( S ¯ − s _ ) 2 ( 1 − γ ) 2 ε 2 , 1 ( 1 − γ ) 4 ε 2 } ) sample complexity bound, where ( S ¯ − s _ ) 2 corresponds to the number of state-action pairs in a generic MDP with state space S and action space A . As such, O ~ ( ( S ¯ − s _ ) 2 ( 1 − γ ) 2 ε 2 ) improves on the optimal generic RL bound Θ ~ ( ( S ¯ − s _ ) 2 ( 1 − γ ) 3 ε 2 ) (when directly applying Θ ~ ( | S | | A | ( 1 − γ ) 3 ε 2 ) here) by a factor of ( 1 − γ ) − 1 , and O ~ ( 1 ( 1 − γ ) 4 ε 2 ) is able to completely remove the dependence on state and action cardinality. In the infinite-horizon average cost setting, we obtain an O ~ ( ( S ¯ − s _ ) 2 ε 2 ) bound, improving on the generic optimal RL bound Θ ~ ( ( S ¯ − s _ ) 2 t m i x ε 2 ) (when directly applying Θ ~ ( | S | | A | t m i x ε 2 ) here) by a factor of t mix , and hence removing the mixing time dependence. By carefully leveraging the structural properties of the inventory dynamics in various settings, we are able to improve on those “best-possible” bounds developed in the reinforcement learning (RL) literature. Our results demonstrate the drawbacks one could face by blindly following RL algorithms and the necessity of designing sample efficient algorithms that properly incorporate the special structures of the inventory systems.
科研通智能强力驱动
Strongly Powered by AbleSci AI