Kybernetika 48 no. 3, 429-452, 2012

Tree compression pushdown automaton

Jan Janoušek, Bořivoj Melichar and Martin Poliak

Abstract:

A new kind of a deterministic pushdown automaton, called a \emph{Tree Compression Automaton}, is presented. The tree compression automaton represents a complete compressed index of a set of trees for subtrees and accepts all subtrees of given trees. The algorithm for constructing our pushdown automaton is incremental. For a single tree with $n$ nodes, the automaton has at most $n+1$ states, its transition function cardinality is at most $4n$ and there are $2n+1$ pushdown store symbols. If hashing is used for storing automaton's transitions, thus removing a factor of $log n$, the construction of the automaton takes linear time and space with respect to the length $n$ of the input tree(s). Our pushdown automaton construction can also be used for finding all subtree repeats without augmenting the overall complexity.

Keywords:

pushdown automata, trees, indexing trees, arbology, compression

Classification:

05C05, 68Q68

References:

  1. : Arbology www pages: Available on \url{http://www.arbology.org/}, March 2012.   CrossRef
  2. A. V. Aho and J. D. Ullman: The Theory of Parsing, Translation, and Compiling. Prentice-Hall Englewood Cliffs, N.J. 1972.   CrossRef
  3. H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov and P. E. Bourne: The protein data bank. Nucleic Acids Res. 28 (2000), 235-242.   CrossRef
  4. P. Bille: Pattern Matching in Trees and Strings. Ph.D. Thesis, FIT University of Copenhagen 2008.   CrossRef
  5. G. Busatto, M. Lohrey and S. Maneth: Grammar-based Tree Compression. Technical Report 2004.   CrossRef
  6. L. Cleophas: Tree Algorithms. Two Taxonomies and a Toolkit. Ph.D. Thesis, Technische Universiteit Eindhoven 2008.   CrossRef
  7. R. Cole, R. Hariharan and P. Indyk: Tree pattern matching and subset matching in deterministic $log^3$-time. SODA (1999), 245-254.   CrossRef
  8. H. Comon, M. Dauchet, R. Gilleron, C. Löding, F. Jacquemard, D. Lugiez, S. Tison and M. Tommasi: Tree automata techniques and applications. Available on: \url{http://www.grappa.univ-lille3.fr/tata} (2007).   CrossRef
  9. F. Gecseg and M. Steinby: Tree languages. In: Handbook of Formal Languages (G. Rozenberg and A. Salomaa, eds.), Vol. 3 Beyond Words. Springer-Verlag, Berlin 1997, pp. 1-68.   CrossRef
  10. T. Flouri, J. Janoušek and B. Melichar: Subtree and tree pattern pushdown automata for trees in prefix notation (2009). Comput. Sci. Inform. Systems 7 (2010), 2, 331-357.   CrossRef
  11. Ch. Hoffmann and M. O'Donnell: Pattern matching in trees. J. Assoc. Comput. Mach. 29 (1982), 1, 68-95.   CrossRef
  12. J. E. Hopcroft, R. Motwani and J. D. Ullman: Introduction to Automata Theory, languages, and Computation. Second edition. Addison-Wesley, Boston 2001.   CrossRef
  13. C. H. Wu, R. Apweiler, A. Bairoch, D. A. Natale, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang and R. Lopez et al: The Universal Protein Resource (UniProt) - An Expanding Universe of Protein Information. Database issue, Nucleic Acids Research, Oxford University Press, D187-D191 2006.   CrossRef
  14. A. R. Schmidt, F. Waas, M. L. Kersten, M. J. Carey, I. Manolescu and R. Busse: XMark - A benchmark for XML data management. In: Proc. International Conference on Very Large Data Bases (VLDB), Hong Kong 2002, pp. 974-985.   CrossRef
  15. L. Van Der Beek, G. Bouma, R. Malouf, G. Van Noord and Rijksuniversiteit Groningen: The Alpino dependency treebank. In: Computational Linguistics in the Netherlands (CLIN) 2002, pp. 1686-1691.   CrossRef
  16. T. Welch: A technique for high-performance data compression. Computer 17 (1984), 6, 8-19.   CrossRef
  17. J. Ziv and A. Lempel: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23 (1977), 3, 337-343.   CrossRef