Kybernetika 55 no. 1, 81-113, 2019

A perturbation approach to approximate value iteration for average cost Markov decision processes with Borel spaces and bounded costs

Óscar Vega-Amaya and Joaquín López-BorbónDOI: 10.14736/kyb-2019-1-0081

Abstract:

The present paper studies the \textit{approximate value iteration} (AVI) algorithm for the average cost criterion with bounded costs and Borel spaces. It is shown the convergence of the algorithm and provided a performance bound assuming that the model satisfies a standard continuity-compactness assumption and a uniform ergodicity condition. This is done for the class of approximation procedures that can be represented by linear positive operators which give exact representation of constant functions and also satisfy certain continuity property. The main point is that these operators define transition probabilities on the state space of the controlled system. This has the following important consequences: (a) the approximating function is the average value of the target function with respect to the induced transition probability; (b) the approximation step in the AVI algorithm can be seen as a perturbation of the original Markov model; (c) the perturbed model inherits the ergodicity properties imposed on the original Markov model. These facts allow to bound the AVI algorithm performance in terms of the accuracy of the approximations given by this kind of operators for the primitive data model, namely, the one-step reward function and the system transition law. The bounds are given in terms of the supremum norm of bounded functions and the total variation norm of finite-signed measures. The results are illustrated with numerical approximations for a class of single item inventory systems with linear order cost, no set-up cost and no back-orders.

Keywords:

Markov decision processes, average cost criterion, approximate value iteration algorithm, contraction and non-expansive operators, perturbed Markov decision models

Classification:

90C40, 90C59, 93E20

References:

  1. J. Abounadi, D. Bertsekas and V. S. Borkar: Learning algorithms for Markov decision processes with average cost. SIAM J. Control Optim. 40 (2001), 681-698.   DOI:10.1137/s0363012999361974
  2. C. D. Aliprantis and K. C. Border: Infinite Dimensional Analysis. Third edition. Springer-Verlag, Berlin 2006.   CrossRef
  3. A. Almudevar: Approximate fixed point iteration with an application to infinite horizon Markov decision processes. SIAM J. Control Optim. 46 (2008), 541-561.   DOI:10.1137/040614384
  4. A. Araposthatis, V. S. Borkar, E. Fernández-Guacherand, M. K. Gosh and S. I. Marcus: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31 (1993) 282-344.   DOI:10.1137/0331018
  5. L. Beutel, H. Gonska and D. Kacsó: On variation-diminishing Shoenberg operators: new quantitative statements. In: Multivariate Approximation and Interpolations with Applications (M. Gasca, ed.), Monografías de la Academia de Ciencias de Zaragoza No. 20 2002, pp. 9-58.   CrossRef
  6. D. P. Bertsekas: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs NJ 1987.   CrossRef
  7. D. P. Bertsekas and J. N. Tsitsiklis: Neuro-Dynamic Programming. Athena Scientific, Belmont 1996.   CrossRef
  8. D. P. Bertsekas: Approximate policy iteration: a survey and some new methods. J. Control Theory Appl. 9 (2011), 310-335.   DOI:10.1007/s11768-011-1005-3
  9. H., \S. Chang and S. I. Marcus: Approximate receding horizon approach for Markov decision processes: average reward case. J. Math. Anal. Appl. 286 (2003), 636-651.   DOI:10.1016/s0022-247x(03)00506-7
  10. H. S. Chang, J. Hu, M. C. Fu and S. I. Marcus: Simulation-Based Algorithms for Markov Decision Processes. Second edition. Springer-Verlag, London 2013.   DOI:10.1007/978-1-4471-5022-0
  11. W. L. Cooper, S. G. Henderson and M. E. Lewis: Convergence of simulation-based policy iteration. Prob. Eng. Inform. Sci. 17 (2003), 213-234.   DOI:10.1017/s0269964803172051
  12. R. A. DeVore: The Approximation of Continuous Functions by Positive Linear Operators. Lectures Notes in Mathematics 293. Springer-Verlag, Berlin, Heidelberg 1972.   DOI:10.1007/bfb0059493
  13. C. C. Y. Dorea and A. G. C. Pereira: A note on a variations of Doeblin's condition for uniform ergodicity of Markov chains. Acta Math. Hungar. 110, Issue 4, (2006), 287-292.   DOI:10.1007/s10474-006-0023-y
  14. F. Dufour adn T. Prieto-Rumeau: Approximation of Markov decision processes with general state space. J. Math. Anal. Appl. 388 (2012), 1254-1267.   DOI:10.1016/j.jmaa.2011.11.015
  15. F. Dufour and T. Prieto-Rumeau: Stochastic approximations of constrained discounted Markov decision processes. J. Math. Anal. Appl. 413 (2014), 856-879.   DOI:10.1016/j.jmaa.2013.12.016
  16. F. Dufour and T. Prieto-Rumeau: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics 87 (2015), 273-307.   DOI:10.1080/17442508.2014.939979
  17. D. P. de Farias and B. van Roy: On the existence of fixed points for approximate value iteration and temporal difference learning. J. Optim. Theory Appl. 105 (2000), 589-608.   DOI:10.1023/a:1004641123405
  18. D. P. de Farias and B. van Roy: Approximate linear programming for average-cots dynamic programming. In: Advances in Neural Information Processing Systems 15 (S. Becker, S. Thrun and K. Obermayer, eds.), MIT Press, Cambridge MA 2002, pp. 1587-1594.   CrossRef
  19. D. P. de Farias and B. Van Roy: A cost-shaping linear program for average-cost approximate dynamic programming with performance guarantees. Math. Oper. Res. 31 (2006), 597-620.   DOI:10.1287/moor.1060.0208
  20. G. J. Gordon: Stable function approximation dynamic programming. In: Proc. Twelfth International Conference on Machine Learning (A. Prieditis and S. J. Russell, eds.), Tahoe City CA 1995, pp. 261-268.   DOI:10.1016/b978-1-55860-377-6.50040-2
  21. A. Gosavi: A reinforcement learning algorithm based on policy iteration for average reward: empirical results with yield management and convergence analysis. Machine Learning 55 (2004), 5-29.   DOI:10.1023/b:mach.0000019802.64038.6c
  22. O. Hernández-Lerma: Adaptive Markov Control Processes. Springer-Verlag, NY 1989.   DOI:10.1007/978-1-4419-8714-3
  23. O. Hernández-Lerma and J. B. Lasserre: Discrete-Time Markov Control Processes. Basic Optimality Criteria. Springer-Verlag, NY 1996.   DOI:10.1007/978-1-4612-0729-0
  24. O. Hernández-Lerma and J. B. Lasserre: Further Topics on Discrete-Time Markov Control Processes. Springer-Verlag, NY 1999.   DOI:10.1007/978-1-4612-0561-6
  25. O. Hernández-Lerma and J. B. Lasserre: Markov Chains and Invariant Probabilities. Birkhauser Verlag, Basel 2003.   DOI:10.1007/978-3-0348-8024-4
  26. O. Hernández-Lerma, R. Montes-de-Oca and R. Cavazos-Cadena: Recurrence conditions for Markov decision processes with Borel spaces: a survey. Ann. Oper. Res. 29 (1991), 29-46.   DOI:10.1007/bf02055573
  27. O. Hernández-Lerma, O. Vega-Amaya and G. Carrasco: Sample-path optimality and variance-minimization of average cost Markov control processes. SIAM J. Control Optim. 38 (1999), 79-93.   DOI:10.1137/s0363012998340673
  28. A. Jaskiewicz and A. S. Nowak: On the optimality equation for average cost Markov control processes with Feller transitions probabilities. J. Math. Anal. Appl. 316 (2006), 495-509.   DOI:10.1016/j.jmaa.2005.04.065
  29. E. Klein and A. C. Thompson: Theory of Correspondences. Wiley, New York 1984.   CrossRef
  30. V. R. Konda and J. N. Tsitsiklis: Actor-critic algorithms. SIAM J. Control Optim. 42 (2003), 1143-1166.   DOI:10.1137/s0363012901385691
  31. J. M. Lee and J. H. Lee: Approximate dynamic programming strategies and their applicability for process control: a review and future direction. Int. J. Control Automat. Systems 2 (2004), 263-278.   CrossRef
  32. S. P. Meyn and R. L. Tweedie: Markov Chain and Stochastic Stability. Springer-Verlag, London 1993.   DOI:10.1007/978-1-4471-3267-7
  33. R. Montes-de-Oca and E. Lemus-Rodríguez: An unbounded Berge's minimum theorem with applications to discounted Markov decision processes. Kybernetika 48 (2012), 268-286.   CrossRef
  34. R. Munos: Performance bounds in $L_{p}$-norm for approximate value iteration. SIAM J. Control Optim. 47 (2007), 2303-2347.   DOI:10.1137/s0363012904441520
  35. A. S. Nowak: A generalization of Ueno's inequality for n-step transition probabilities. Appl. Math. 25 (1998), 295-299.   DOI:10.4064/am-25-3-295-299
  36. R. Ortner: Pseudometrics for state aggregation in average reward Markov decision processes. In: Algorithmic Learning Theory LNAI 4754 (M. Hutter, R. A. Serveido and E. Takimoto, eds.), Springer, Berlin, Heidelberg 2007, pp. 373-387.   DOI:10.1007/978-3-540-75225-7\_30
  37. M. L. Puterman: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, NY 1994.   DOI:10.1002/9780470316887
  38. W. P. Powell: Approximate Dynamic Programming. Solving the Curse of Dimensionality. John Wiley and Sons Inc., 2007.   DOI:10.1002/9780470182963
  39. W. P. Powell: What you should know about approximate dynamic programming. Naval Res. Logist. 56 (2009), 239-249.   DOI:10.1002/nav.20347
  40. W. P. Powell: Perspectives of approximate dynamic programming. Ann. Oper. Res. 241 (2012), 319-356.   DOI:10.1007/s10479-012-1077-6
  41. W. P. Powell and J. Ma: A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. J. Control Theory Appl. 9 (2011), 336-352.   DOI:10.1007/s11768-011-0313-y
  42. M. T. Robles-Alcaraz, O. Vega-Amaya and J. Adolfo Minjárez-Sosa: Estimate and approximate policy iteration algorithm for discounted Markov decision models with bounded costs and Borel spaces. Risk Decision Anal. 6 (2017), 79-95.   DOI:10.3233/rda-160116
  43. J. Rust: Numerical dynamic programming in economics. In: Handbook of Computational Economics, Vol. 13 (H. Amman, D. Kendrick and J. Rust, eds.), North-Holland, Amsterdam 1996, pp. 619-728.   DOI:10.1016/s1574-0021(96)01016-7
  44. M. S. Santos: Analysis of a numerical dynamic programming algorithm applied to economic models. Econometrica 66 (1998), 409-426.   DOI:10.2307/2998564
  45. M. S. Santos and J. Rust: Convergence properties of policy iteration. SIAM J. Control Optim. 42 (2004) 2094-2115.   DOI:10.1137/s0363012902399824
  46. N. Saldi, S. Yuksel and T. Linder: Asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Math. Oper. Res. 42 (2017), 945-978.   DOI:10.1287/moor.2016.0832
  47. J. Stachurski: Continuous state dynamic programming via nonexpansive approximation. Comput. Economics 31 (2008), 141-160.   DOI:10.1007/s10614-007-9111-5
  48. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction. MIT Press, Cambridge MA 1998.   DOI:10.1108/k.1998.27.9.1093.3
  49. B. Van Roy: Performance loss bounds for approximate value iteration with state aggregation. Math. Oper. Res. 31 (2006), 234-244.   DOI:10.1287/moor.1060.0188
  50. O. Vega-Amaya: The average cost optimality equation: a fixed-point approach. Bol. Soc. Mat. Mexicana 9 (2003), 185-195.   CrossRef
  51. O. Vega-Amaya: Zero-sum average semi-Markov games: fixed point solutions of the Shapley equation. SIAM J. Control Optim. {\mi 42} (2003), 1876-1894.   DOI:10.1137/s0363012902408423
  52. O. Vega-Amaya: Solutions of the average cost optimality equation for Markov decision processes with weakly continuous kernel: The fixed-point approach revisited. J. Math. Anal. Appl. 464 (2018), 152-163.   DOI:10.1016/j.jmaa.2018.03.077
  53. O. Vega-Amaya and J. López-Borbón: A Perturbation approach to a class of discounted approximate value iteration algorithms with Borel spaces. J. Dyn. Games 3 (2016), 261-278.   DOI:10.3934/jdg.2016014
  54. O. Vega-Amaya amd R. Montes-de-Oca: Application of average dynamic programming to inventory systems. Math. Methods Oper. Res. 47 (1998) 451-471.   DOI:10.1007/bf01198405