# A backward selection procedure for approximating a discrete probability distribution by decomposable models

Francesco M. Malvestuto

# Abstract:

Decomposable (probabilistic) models are log-linear models generated by acyclic hypergraphs, and a number of nice properties enjoyed by them are known. In many applications the following selection problem naturally arises: given a probability distribution $p$ over a finite set $V$ of $n$ discrete variables and a positive integer $k$, find a decomposable model with tree-width $k$ that best fits $p$. If $\mathcal{H}$ is the generating hypergraph of a decomposable model and $p_{\mathcal{H}}$ is the estimate of $p$ under the model, we can measure the closeness of $p_{\mathcal{H}}$ to $p$ by the information divergence $D(p: p_{\mathcal{H}})$, so that the problem above reads: given $p$ and $k$, find an acyclic, connected hypergraph ${\mathcal{H}}$ of tree-width $k$ such that $D(p: p_{\mathcal{H}})$ is minimum. It is well-known that this problem is $NP$-hard. However, for $k = 1$ it was solved by Chow and Liu in a very efficient way; thus, starting from an optimal Chow-Liu solution, a few forward-selection procedures have been proposed with the aim at finding a good' solution for an arbitrary $k$. We propose a backward-selection procedure which starts from the (trivial) optimal solution for $k=n-1$, and we show that, in a study case taken from literature, our procedure succeeds in finding an optimal solution for every $k$.

# Keywords:

decomposable model, acyclic hypergraph, backward selection, information divergence, $k$-hypertree

# Classification:

05C65, 62-09, 68R10, 68T05

# References:

1. R. Almond and A. Kong: Optimality Issues in Constructing a Markov Tree from Graphical Models. Research Report A-3, Dept. Statistics, Harvard University, 1991.   CrossRef
2. A. Altmüller and R. M. Haralick: Approximating high dimensional probability disributions. In: Proc. XVII Int. Conf. on Patter Recognitions, 2004.   CrossRef
3. A. Altmüller and R. M. Haralick: Practical aspects of efficient forward selection in decomposable graphical models. In: Proc. XVI IEEE Int. Conf. on Tools with Artificial Intelligence, 2004, pp. 710-715.   CrossRef
4. F. R. Bach and M. I. Jordan: Thin junction trees. Adv. in Neural Inform. Proces. Systems 14 (2002), 569-572.   CrossRef
5. J.-H. Badsberg and F. M. Malvestuto: An implementation of the iterative proportional fitting procedure by propagation trees. Comput. Statist. Data Anal. 37 (2001), 297-322.   CrossRef
6. C. Beeri, R. Fagin, D. Maier and M. Yannakakis: On the desirability of acyclic database schemes. J. ACM 30 (1983), 479-513.   CrossRef
7. L. W. Beineke and R. E. Pippert: The enumeration of labelled 2-trees. J. Combin. Theory 6 (1969), 200-205.   CrossRef
8. Y. Bishop, S. E. Fienberg and P. W. Holland: Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge 1975.   CrossRef
9. D. T. Brown: A note on approximations to discrete probability distributions. Inform. and Control 2 (1959), 386-392.   CrossRef
10. D. Chickering: Learning Bayesian networks is NP-complete. In: Learning from Data, Lecture Notes in Statist. 112 (1996), 121-130.   CrossRef
11. C. K. Chow and C. N. Liu: Approximating discrete probability distributions with dependence trees. IEEE Trans. Inform. Theory 14 (1968), 462-467.   CrossRef
12. T. M. Cover: Elements of Information Theory. John Wiley and Sons, 1991.   CrossRef
13. I. Csiszár and J. Körner: Information Theory. Academic Press, 1981.   CrossRef
14. P. Dagum and M. Luby: Approximating probabilistic inference in belief networks is NP-hard. Artificial Intelligence 60 (1993), 141-153.   CrossRef
15. S. Dasgupta: Learning polytrees. In: Proc. XV Conference on Uncertainty in Artificial Intelligence, 1999, pp. 134-141.   CrossRef
16. A. Deshpande, M. Garofalakis and M. I. Jordan: Efficient stepwise selection in decomposable models. In: Proc. XVII Conf. on Uncertainty in Artificial Intelligence, 2001, pp. 128-135.   CrossRef
17. G. Ding, R. F. Lax, J. Chen, P. P. Chen and B. D. Marx: Comparison of greedy strategies for learning Markov networks of treewidth $k$. In: Proc. Int. Conf. on Machine Learning: Models, Technologies and Applications, 2007, pp. 294-301.   CrossRef
18. T. Havránek: On model search methods. In: Proc. IX Symp. on Computational Statistics, 1990, pp. 101-108.   CrossRef
19. T. Havránek: Simple formal systems in data analysis. In: Proc. Conf. on Symbolic-Numeric Data Analysis and Learning, 1991, pp. 373-381.   CrossRef
20. F. V. Jensen and F. Jensen: Optimal junction trees. In: Proc. X Conf. on Uncertainty in Artificial Intelligence (R. L. de Mantaras and D. Poole, eds.), 1994, pp. 360-366.   CrossRef
21. D. Karger and N. Srebro: Learning Markov networks: maximum bounded tree-width graphs. In: Proc. XII ACM-SIAM Symp. on Discrete Mathematics, 2001, pp. 392-401.   CrossRef
22. T. Kloks: Tree-width. LNCS 842, Springer Verlag, Berlin 1994.   CrossRef
23. T. Kocka: New algorithm for learning decomposable models. Unpublished manuscript, 2000.   CrossRef
24. E. Kovács and T. Szántai: Vine copulas as a mean for the construction of high dimensional probability distribution associated to a Markov network. arXiv:1105.1697v1, 2011.   CrossRef
25. H. H. Ku and S. Kullback: Approximating discrete probability distributions. IEEE Trans. Inform. Theory 15 (1969), 444-447.   CrossRef
26. S. L. Lauritzen: Graphical Models. Clarendon Press, Oxford 1996.   CrossRef
27. P. M. Lewis II: Approximating probability distributions to reduce storage requirements. Inform. and Control 2 (1959), 214-225.   CrossRef
28. F. M. Malvestuto: Operations research in the design of statistical databases (in Italian). In: Proc. AIRO Meeting on Operations Research and Computer Science, 1986, pp. 117-130.   CrossRef
29. F. M. Malvestuto: Approximating discrete probability distributions with decomposable models. IEEE Trans. Systems, Man and Cybernetics 21 (1991), 1287-1294.   CrossRef
30. F. M. Malvestuto: An axiomatization of loglinear models with an application to the model search. In: Learning from Data, LNS 112 (1996), pp. 175-184.   CrossRef
31. F. M. Malvestuto: Designing a probabilistic database from a given set of full conditional independences. In: Proc. Workshop on Conditional Independence Structures and Graphical Models, 1999.   CrossRef
32. F. M. Malvestuto: A hypergraph-theoretic analysis of collapsibility and decomposability for extended log-linear models. Statist. Comput. 11 (2001), 155-169.   CrossRef
33. F. M. Malvestuto: From conditional independences to factorization constraints with discrete random variables. Ann. Math. and Artificial Intelligence 35 (2002), 253-285.   CrossRef
34. F. M. Malvestuto: Tree and local computations in a cross-entropy minimization problem with marginal constraints. Kybernetika 46 (2010), 621-654.   CrossRef
35. C. Meek: Finding a path is harder than finding a tree. J. Artificial Intelligence Res. 15 (2001), 383-389.   CrossRef
36. M. Mezzini and M. Moscarini: Simple algorithms for minimal triangulation of a graph and backward selection of a decomposable Markov network. Theoret. Comput. Sci. 411 (2010), 958-966.   CrossRef
37. K. Nunez, J. Chen, P. Chen, G. Ding, R. F. Lax and B. Marx: Empirical comparison of greedy strategies for learning Markov networks of treewidth $k$. In: Proc. VII Int. Conf. on Machine Learning and Applications, 2008, pp. 106-113.   CrossRef
38. J. D. Rose: On simple characterizations of $k$-trees. Discrete Math. 7 (1974), 317-322.   CrossRef
39. T. Szántai and E. Kovács: Hypergraphs as a mean of discovering the dependence structure of a discrete multivariate probability distribution. In: Proc. Conf. on Applied Mathematical Programming and Modelling, 2008; also in Ann. Oper. Res. 193 (2012), 71-90.   CrossRef
40. T. Szántai and E. Kovács: Discovering a junction tree behind a Markov network by a greedy algorithm. arXiv:1104.2762v3, 2011.   CrossRef
41. R. E. Tarjan and M. Yannakakis: Simple linear-time algorithms to test chordality of graphs, test acyclicity of hypergraphs, and selectively reduce hypergraphs. SIAM J. Comput. 13 (1984), 566-579.   CrossRef
42. N. Wermuth: Analogies between multiplicative models in contingency tables and covariance selection. Biometrics 32 (1976), 95-108.   CrossRef
43. N. Wermuth: Model search among multiplicative models. Biometrics 32 (1976), 253-256.   CrossRef
44. Y. Xiang, S. K. M. Wong and N. Cercone: A `microscopic'' study of minimum entropy search in learning decomposable Markov networks. Mach. Learning 26 (1997), 65-72.   CrossRef