Kybernetika 55 no. 1, 81-113, 2019

A perturbation approach to approximate value iteration for average cost Markov decision processes with Borel spaces and bounded costs

Óscar Vega-Amaya and Joaquín López-BorbónDOI: 10.14736/kyb-2019-1-0081

Abstract:

The present paper studies the \textit{approximate value iteration} (AVI) algorithm for the average cost criterion with bounded costs and Borel spaces. It is shown the convergence of the algorithm and provided a performance bound assuming that the model satisfies a standard continuity-compactness assumption and a uniform ergodicity condition. This is done for the class of approximation procedures that can be represented by linear positive operators which give exact representation of constant functions and also satisfy certain continuity property. The main point is that these operators define transition probabilities on the state space of the controlled system. This has the following important consequences: (a) the approximating function is the average value of the target function with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be seen as a perturbation of the original Markov model; (c) the perturbed model inherits the ergodicity properties imposed on the original Markov model. These facts allow to bound the AVI algorithm performance in terms of the accuracy of the approximations given by this kind of operators for the primitive data model, namely, the one-step reward function and the system transition law. The bounds are given in terms of the supremum norm of bounded functions and the total variation norm of finite-signed measures. The results are illustrated with numerical approximations for a class of single item inventory systems with linear order cost, no set-up cost and no back-orders.

Keywords:

Markov decision processes, average cost criterion, approximate value iteration algorithm, contraction and non-expansive operators, perturbed Markov decision models

Classification:

90C40, 90C59, 93E20

paper.pdf

References:

J. Abounadi, D. Bertsekas and V. S. Borkar: Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim. 40 (2001), 681-698. DOI:10.1137/s0363012999361974
C. D. Aliprantis and K. C. Border: Infinite Dimensional Analysis. Third edition. Springer-Verlag, Berlin 2006. CrossRef
A. Almudevar: Approximate fixed point iteration with an application to infinite horizon Markov decision processes. SIAM J. Control Optim. 46 (2008), 541-561. DOI:10.1137/040614384
A. Araposthatis, V. S. Borkar, E. Fernández-Guacherand, M. K. Gosh and S. I. Marcus: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31 (1993) 282-344. DOI:10.1137/0331018
L. Beutel, H. Gonska and D. Kacsó: On variation-diminishing Shoenberg operators: new quantitative statements. In: Multivariate Approximation and Interpolations with Applications (M. Gasca, ed.), Monografías de la Academia de Ciencias de Zaragoza No. 20 2002, pp. 9-58. CrossRef
D. P. Bertsekas: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs NJ 1987. CrossRef
D. P. Bertsekas and J. N. Tsitsiklis: Neuro-Dynamic Programming. Athena Scientific, Belmont 1996. CrossRef
D. P. Bertsekas: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9 (2011), 310-335. DOI:10.1007/s11768-011-1005-3
H., \S. Chang and S. I. Marcus: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286 (2003), 636-651. DOI:10.1016/s0022-247x(03)00506-7
H. S. Chang, J. Hu, M. C. Fu and S. I. Marcus: Simulation-Based Algorithms for Markov Decision Processes. Second edition. Springer-Verlag, London 2013. DOI:10.1007/978-1-4471-5022-0
W. L. Cooper, S. G. Henderson and M. E. Lewis: Convergence of simulation-based policy iteration. Prob. Eng. Inform. Sci. 17 (2003), 213-234. DOI:10.1017/s0269964803172051
R. A. DeVore: The Approximation of Continuous Functions by Positive Linear Operators. Lectures Notes in Mathematics 293. Springer-Verlag, Berlin, Heidelberg 1972. DOI:10.1007/bfb0059493
C. C. Y. Dorea and A. G. C. Pereira: A note on a variations of Doeblin's condition for uniform ergodicity of Markov chains. Acta Math. Hungar. 110, Issue 4, (2006), 287-292. DOI:10.1007/s10474-006-0023-y
F. Dufour adn T. Prieto-Rumeau: Approximation of Markov decision processes with general state space. J. Math. Anal. Appl. 388 (2012), 1254-1267. DOI:10.1016/j.jmaa.2011.11.015
F. Dufour and T. Prieto-Rumeau: Stochastic approximations of constrained discounted Markov decision processes. J. Math. Anal. Appl. 413 (2014), 856-879. DOI:10.1016/j.jmaa.2013.12.016
F. Dufour and T. Prieto-Rumeau: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics 87 (2015), 273-307. DOI:10.1080/17442508.2014.939979
D. P. de Farias and B. van Roy: On the existence of fixed points for approximate value iteration and temporal difference learning. J. Optim. Theory Appl. 105 (2000), 589-608. DOI:10.1023/a:1004641123405
D. P. de Farias and B. van Roy: Approximate linear programming for average-cots dynamic programming. In: Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun and K. Obermayer, eds.), MIT Press, Cambridge MA 2002, pp. 1587-1594. CrossRef
D. P. de Farias and B. Van Roy: A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res. 31 (2006), 597-620. DOI:10.1287/moor.1060.0208
G. J. Gordon: Stable function approximation dynamic programming. In: Proc. Twelfth International Conference on Machine Learning (A. Prieditis and S. J. Russell, eds.), Tahoe City CA 1995, pp. 261-268. DOI:10.1016/b978-1-55860-377-6.50040-2
A. Gosavi: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Machine Learning 55 (2004), 5-29. DOI:10.1023/b:mach.0000019802.64038.6c
O. Hernández-Lerma: Adaptive Markov Control Processes. Springer-Verlag, NY 1989. DOI:10.1007/978-1-4419-8714-3
O. Hernández-Lerma and J. B. Lasserre: Discrete-Time Markov Control Processes. Basic Optimality Criteria. Springer-Verlag, NY 1996. DOI:10.1007/978-1-4612-0729-0
O. Hernández-Lerma and J. B. Lasserre: Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, NY 1999. DOI:10.1007/978-1-4612-0561-6
O. Hernández-Lerma and J. B. Lasserre: Markov Chains and Invariant Probabilities. Birkhauser Verlag, Basel 2003. DOI:10.1007/978-3-0348-8024-4
O. Hernández-Lerma, R. Montes-de-Oca and R. Cavazos-Cadena: Recurrence conditions for Markov decision processes with Borel spaces: a survey. Ann. Oper. Res. 29 (1991), 29-46. DOI:10.1007/bf02055573
O. Hernández-Lerma, O. Vega-Amaya and G. Carrasco: Sample-path optimality and variance-minimization of average cost Markov control processes. SIAM J. Control Optim. 38 (1999), 79-93. DOI:10.1137/s0363012998340673
A. Jaskiewicz and A. S. Nowak: On the optimality equation for average cost Markov control processes with Feller transitions probabilities. J. Math. Anal. Appl. 316 (2006), 495-509. DOI:10.1016/j.jmaa.2005.04.065
E. Klein and A. C. Thompson: Theory of Correspondences. Wiley, New York 1984. CrossRef
V. R. Konda and J. N. Tsitsiklis: Actor-critic algorithms. SIAM J. Control Optim. 42 (2003), 1143-1166. DOI:10.1137/s0363012901385691
J. M. Lee and J. H. Lee: Approximate dynamic programming strategies and their applicability for process control: a review and future direction. Int. J. Control Automat. Systems 2 (2004), 263-278. CrossRef
S. P. Meyn and R. L. Tweedie: Markov Chain and Stochastic Stability. Springer-Verlag, London 1993. DOI:10.1007/978-1-4471-3267-7
R. Montes-de-Oca and E. Lemus-Rodríguez: An unbounded Berge's minimum theorem with applications to discounted Markov decision processes. Kybernetika 48 (2012), 268-286. CrossRef
R. Munos: Performance bounds in $L_{p}$-norm for approximate value iteration. SIAM J. Control Optim. 47 (2007), 2303-2347. DOI:10.1137/s0363012904441520
A. S. Nowak: A generalization of Ueno's inequality for n-step transition probabilities. Appl. Math. 25 (1998), 295-299. DOI:10.4064/am-25-3-295-299
R. Ortner: Pseudometrics for state aggregation in average reward Markov decision processes. In: Algorithmic Learning Theory LNAI 4754 (M. Hutter, R. A. Serveido and E. Takimoto, eds.), Springer, Berlin, Heidelberg 2007, pp. 373-387. DOI:10.1007/978-3-540-75225-7\_30
M. L. Puterman: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, NY 1994. DOI:10.1002/9780470316887
W. P. Powell: Approximate Dynamic Programming. Solving the Curse of Dimensionality. John Wiley and Sons Inc., 2007. DOI:10.1002/9780470182963
W. P. Powell: What you should know about approximate dynamic programming. Naval Res. Logist. 56 (2009), 239-249. DOI:10.1002/nav.20347
W. P. Powell: Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (2012), 319-356. DOI:10.1007/s10479-012-1077-6
W. P. Powell and J. Ma: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9 (2011), 336-352. DOI:10.1007/s11768-011-0313-y
M. T. Robles-Alcaraz, O. Vega-Amaya and J. Adolfo Minjárez-Sosa: Estimate and approximate policy iteration algorithm for discounted Markov decision models with bounded costs and Borel spaces. Risk Decision Anal. 6 (2017), 79-95. DOI:10.3233/rda-160116
J. Rust: Numerical dynamic programming in economics. In: Handbook of Computational Economics, Vol. 13 (H. Amman, D. Kendrick and J. Rust, eds.), North-Holland, Amsterdam 1996, pp. 619-728. DOI:10.1016/s1574-0021(96)01016-7
M. S. Santos: Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica 66 (1998), 409-426. DOI:10.2307/2998564
M. S. Santos and J. Rust: Convergence properties of policy iteration. SIAM J. Control Optim. 42 (2004) 2094-2115. DOI:10.1137/s0363012902399824
N. Saldi, S. Yuksel and T. Linder: Asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42 (2017), 945-978. DOI:10.1287/moor.2016.0832
J. Stachurski: Continuous state dynamic programming via nonexpansive approximation. Comput. Economics 31 (2008), 141-160. DOI:10.1007/s10614-007-9111-5
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. MIT Press, Cambridge MA 1998. DOI:10.1108/k.1998.27.9.1093.3
B. Van Roy: Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31 (2006), 234-244. DOI:10.1287/moor.1060.0188
O. Vega-Amaya: The average cost optimality equation: a fixed-point approach. Bol. Soc. Mat. Mexicana 9 (2003), 185-195. CrossRef
O. Vega-Amaya: Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation. SIAM J. Control Optim. {\mi 42} (2003), 1876-1894. DOI:10.1137/s0363012902408423
O. Vega-Amaya: Solutions of the average cost optimality equation for Markov decision processes with weakly continuous kernel: The fixed-point approach revisited. J. Math. Anal. Appl. 464 (2018), 152-163. DOI:10.1016/j.jmaa.2018.03.077
O. Vega-Amaya and J. López-Borbón: A Perturbation approach to a class of discounted approximate value iteration algorithms with Borel spaces. J. Dyn. Games 3 (2016), 261-278. DOI:10.3934/jdg.2016014
O. Vega-Amaya amd R. Montes-de-Oca: Application of average dynamic programming to inventory systems. Math. Methods Oper. Res. 47 (1998) 451-471. DOI:10.1007/bf01198405

Kybernetika

Journal