Kybernetika 57 no. 6, 879-907, 2021

On the Jensen-Shannon divergence and the variation distance for categorical probability distributions

Jukka Corander, Ulpu Remes and Timo KoskiDOI: 10.14736/kyb-2021-6-0879


We establish a decomposition of the Jensen-Shannon divergence into a linear combination of a scaled Jeffreys' divergence and a reversed Jensen-Shannon divergence. Upper and lower bounds for the Jensen-Shannon divergence are then found in terms of the squared (total) variation distance. The derivations rely upon the Pinsker inequality and the reverse Pinsker inequality. We use these bounds to prove the asymptotic equivalence of the maximum likelihood estimate and minimum Jensen-Shannon divergence estimate as well as the asymptotic consistency of the minimum Jensen-Shannon divergence estimate. These are key properties for likelihood-free simulator-based inference.


blended divergences, Chan-Darwiche metric, likelihood-free inference, implicit maximum likelihood, reverse Pinsker inequality, simulator-based inference


62B10, 62H05, 94A17


  1. N. S. Barnet and S. Dragomir: A survey of recent inequalities for $\phi$-divergences of discrete probability distributions. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 1-85.   DOI:10.1002/ev.254
  2. M. Basseville: Divergence measures for statistical data processing $-$ An annotated bibliography. Signal Processing 93 (2013), 621-633.   DOI:10.1016/j.sigpro.2012.09.003
  3. D. Berend and A. Kontorovich: A sharp estimate of the binomial mean absolute deviation with applications. Stat. Probab. Lett. 83 (2013), 1254-259.   CrossRef
  4. BOLFI Tutorial and Manual:, 2017.    CrossRef
  5. U. Böhm, P. F. Dahm, B. F. McAllister and I. F. Greenbaum: Identifying chromosomal fragile sites from individuals: a multinomial statistical model. Human Genetics 95 (1995), 249-256.   CrossRef
  6. H. Chan and A. Darwiche: A distance measure for bounding probabilistic belief change. Int. J. Approx. Reasoning 38 (2005), 149-174.   DOI:10.1016/j.ijar.2004.07.001
  7. H. Chan and A. Darwiche: On the revision of probabilistic beliefs using uncertain evidence. Artif. Intell. 163 (2005), 67-90.   CrossRef
  8. C. D. Charalambous, I. Tzortzis, S. Loyka and T. Charalambous: Extremum problems with total variation distance and their applications. IEEE Trans. Automat. Control 59 (2014), 2353-2368.   DOI:10.1109/TAC.2014.2321951
  9. J. Corander, C. Fraser, M. U. Gutmann, B. Arnold, W. P. Hanage, S. D. Bentley, M. Lipsitch and N. J. Croucher: Frequency-dependent selection in vaccine-associated pneumococcal population dynamics. Nature Ecology Evolution 1 (2017), 1950-1960.   DOI:10.1038/s41559-017-0337-x
  10. Th. M. Cover and J. A. Thomas: Elements of Information Theory. Second edition. John Wiley and Sons, New York 2012.   CrossRef
  11. K. Cranmer, J. Brehmer and G. Louppe: The frontier of simulation-based inference. Proc. Natl. Acad. Sci. USA 117 (2020), 30055-30062.   DOI:10.1073/pnas.1912789117
  12. I. Csiszár and Z. Talata: Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inform. Theory 52 (2006), 1007-1016.   DOI:10.1109/TIT.2005.864431
  13. I. Csiszár and P. C. Shields: Information Theory and Statistics: A tutorial. Now Publishers Inc, Delft 2004.   CrossRef
  14. L. Devroye: The equivalence of weak, strong and complete convergence in $ L_1 $ for kernel density estimates. Ann. Statist. 11 (1983), 896-904.   CrossRef
  15. P. J. Diggle and R. J. Gratton: Monte Carlo methods of inference for implicit statistical models. J. R. Stat. Soc. Ser. B. Stat. Methodol. 46, (1984), 193-212.   CrossRef
  16. D. M.Endre and J. E. Schindelin: A new metric for probability distributions. IEEE Trans. Inform. Theory 49 (2003), 1858-1860.   DOI:10.1109/TIT.2003.813506
  17. A. A. Fedotov, P. Harremoës and F. Topsøe: Refinements of Pinsker's inequality. IEEE Trans. Inform. Theory 49 (2003), 1491-1498.   DOI:10.1109/TIT.2003.811927
  18. A. L. Gibbs and F. E. Su: On choosing and bounding probability metrics. Int. Stat. Rev. 70 (2002), 419-435.   DOI:10.1111/j.1751-5823.2002.tb00178.x
  19. A. Guntuboyina: Lower bounds for the minimax risk using $ f $-divergences, and applications. IEEE Trans. Inform. Theory 57 (2011), 2386-2399.   DOI:10.1109/TIT.2011.2110791
  20. M. U. Gutmann and J. Corander: Bayesian optimization for likelihood-free inference of simulator-based statistical models. J. Mach. Learn. Res. 17, (2016), 4256-4302.   CrossRef
  21. M. Gyllenberg, T. Koski, E. Reilink and M. Verlaan: Non-uniqueness in probabilistic numerical identification of bacteria. J. App. Prob. 31 (1994), 542-548.   DOI:10.1017/S0021900200045034
  22. M. Gyllenberg and T. Koski: Numerical taxonomy and the principle of maximum entropy. J. Classification 13 (1996), 213-229.   DOI:10.1007/BF01246099
  23. I. Holopainen: Evaluating Uncertainty with Jensen-Shannon Divergence. Master's Thesis, Faculty of Science, University of Helsinki 2021.   CrossRef
  24. C-D. Hou, J. Chiang and J. J. Tai: Identifying chromosomal fragile sites from a hierarchical-clustering point of view. Biometrics 57 (2001), 435-440.   DOI:10.1111/j.0006-341X.2001.00435.x
  25. M. Janžura and P. Boček: A method for knowledge integration. Kybernetika 34 (1998), 41-55.   CrossRef
  26. N. Jardine and R. Sibson: Mathematical Taxonomy. J. Wiley and Sons, London 1971.   CrossRef
  27. M. Khosravifard, D. Fooladivanda and T. A. Gulliver: Exceptionality of the variational distance. In: 2006 IEEE Information Theory Workshop-ITW'06 Chengdu 2006, pp. 274-276.   CrossRef
  28. T. Koski: Probability Calculus for Data Science. Studentlitteratur, Lund 2020.   CrossRef
  29. V. Kůs: Blended $\phi $-divergences with examples. Kybernetika 39 (2003), 43-54.   CrossRef
  30. V. Kůs, D. Morales and I. Vajda: Extensions of the parametric families of divergences used in statistical inference. Kybernetika 44 (2008), 95-112.   DOI:10.1111/j.1399-0004.1993.tb03860.x
  31. L. LeCam: On the assumptions used to prove asymptotic normality of maximum likelihood estimates. Ann. Math. Statist. 41 (1970), 802-828.   DOI:10.1214/aoms/1177696960
  32. F. Liese and I. Vajda: On divergences and informations in statistics and information theory. IEEE Trans. Inform. Theory 52 (2006), 4394-4412.   DOI:10.1109/TIT.2006.881731
  33. K. Li and J. Mitendra: Implicit maximum likelihood estimation. arXiv preprint arXiv:1809.09087, 2018).   CrossRef
  34. J. Lin: Divergence measures based on the Shannon entropy. IEEE Trans. Inform. Theory 37 (1991), 145-151.   DOI:10.1109/18.61115
  35. J. Lintusaari, M. U Gutmann, R. Dutta, S. Kaski and J. Corander: Fundamentals and recent developments in approximate Bayesian computation. Systematic Biology 66 (2017), e66-e82.   CrossRef
  36. J. Lintusaari, H. Vuollekoski, A. Kangasrääsiö, K. Skytén, M. Järvenpää, P. Marttinen, M. U. Gutmann, A. Vehtari, J. Corander and S. Kaski: ELFI: Engine for likelihood-free inference. J. Mach. Learn. Res. 19 (2018), 1-7.   CrossRef
  37. D. Morales, L. Pardo and I. Vajda: Asymptotic divergence of estimates of discrete distributions. J. Statist. Plann. Inference 48 (1995), 347-369.   DOI:10.1016/0378-3758(95)00013-Y
  38. S. Nowozin, B. Cseke and R. Tomioka: f-gan: Training generative neural samplers using variational divergence minimization. Advances Neural Inform. Process. Systems (2016), 271-279.   CrossRef
  39. M. Okamoto: Some inequalities relating to the partial sum of binomial probabilities. Ann. Inst.of Statist. Math. 10 (1959), 29-35.   DOI:10.1007/BF02883985
  40. I. Sason: On f-divergences: Integral representations, local behavior, and inequalities. Entropy 20 (2018), 383-405.   DOI:10.3390/e20050383
  41. I. Sason and S. Verdu: $f$-divergence inequalities. IEEE Trans. Inform. Theory 62 (2016), 5973-6006.   DOI:10.1109/TIT.2016.2603151
  42. M. Shannon: Properties of f-divergences and f-GAN training. arXiv preprint arXiv:2009.00757, 2020.   CrossRef
  43. R. Sibson: Information radius. Z. Wahrsch. Verw. Geb. 14 (1969), 149-160.   DOI:10.1007/BF00537520
  44. M. Sinn and A. Rawat: Non-parametric estimation of Jensen-Shannon divergence in generative adversarial network training. In: International Conference on Artificial Intelligence and Statistics 2018, pp. 642-651.   CrossRef
  45. I. J. Taneja: On mean divergence measures. In: Advances in Inequalities from Probability Theory and Statistics (N. S. Barnett and S. S. Dragomir, eds.), Nova Science Publishing, New York 2008, pp. 169-186.   CrossRef
  46. F. Topsøe: Information-theoretical optimization techniques. Kybernetika 15 (1979), 8-27.   CrossRef
  47. F. Topsøe: Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inform. Theory 46 (2000), 1602-1609.   DOI:10.1109/18.850703
  48. I. Vajda: Note on discrimination information and variation (Corresp.). IEEE Trans. Inform. Theory 16 (1970), 771-773.   DOI:10.1109/TIT.1970.1054557
  49. I. Vajda: Theory of Statistical Inference and Information. Kluwer Academic Publ., Delft 1989.   CrossRef
  50. I. Vajda: On metric divergences of probability measures. Kybernetika 45 (2009), 885-900.   DOI:10.1145/1932682.1869533
  51. J. I. Yellott Jr.: The relationship between Luce's choice axiom, Thurstone's theory of comparative judgment, and the double exponential distribution. J. Math. Psych. 15 (1977), 109-144.   DOI:10.1016/0022-2496(77)90026-8
  52. F. Österreicher and I. Vajda: Statistical information and discrimination. IEEE Trans. Inform. Theory 39 (1993), 1036-1039.   DOI:10.1109/18.256536