Kybernetika 48 no. 3, 386-401, 2012

The finite automata approaches in stringology

Jan Holub

Abstract:

We present an overview of four approaches of the finite automata use in stringology: deterministic finite automaton, deterministic simulation of nondeterministic finite automaton, finite automaton as a model of computation, and compositions of finite automata solutions. We also show how the finite automata can process strings build over more complex alphabet than just single symbols (degenerate symbols, strings, variables).

Keywords:

exact pattern matching, approximate pattern matching, finite automata, dynamic programming, bitwise parallelism, suffix automaton, border array, degenerate symbol

Classification:

93E12, 62A10

References:

  1. K. Abrahamson: Generalized string matching. SIAM J. Comput. 16 (1987), 6, 1039-1051.   CrossRef
  2. A. V. Aho and M. J. Corasick: Efficient string matching: an aid to bibliographic search. Commun. ACM 18 (1975), 6, 333-340.   CrossRef
  3. C. Allauzen, M. Crochemore and M. Raffinot: Factor oracle: A new structure for pattern matching. In: SOFSEM'99, Theory and Practice of Informatics (J. Pavelka, G. Tel, and M. Bartošek, eds.), Milovy 1999, Lect. Notes Comput. Sci. 1725 (1999), Springer-Verlag, Berlin, pp. 295-310.   CrossRef
  4. P. Antoniou, J. Holub, C. S. Iliopoulos, B. Melichar and P. Peterlongo: Finding common motifs with gaps using finite automata. In: Implementation and Application of Automata (O. H. Ibarra and H.-C. Yen, eds.), Lect. Notes Comput. Sci. 4094 (2006), Springer-Verlag, Heidelberg, pp. 69-77.   CrossRef
  5. R. A. Baeza-Yates and G. H. Gonnet: A new approach to text searching. Commun. ACM 35 (1992), 10, 74-82.   CrossRef
  6. B. S. Baker: Parameterized duplication in strings: algorithms and an application to software maintenance. SIAM J. Comput. 26 (1997), 5, 1343-1362.   CrossRef
  7. A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler, M. T. Chen and J. Seiferas: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40 (1985), 1, 31-44.   CrossRef
  8. A. Blumer, J. Blumer, A. Ehrenfeucht, D. Haussler and R. McConnel: Complete inverted files for efficient text retrieval and analysis. J. Assoc. Comput. Mach. 34 (1987), 3, 578-595.   CrossRef
  9. R. S. Boyer and J. S. Moore: A fast string searching algorithm. Commun. ACM 20 (1977), 10, 762-772.   CrossRef
  10. E. Cambouropoulos, M. Crochemore, C. S. Iliopoulos, L. Mouchard and Y. J. Pinzon: Algorithms for computing approximate repetitions in musical sequences. In: Proc. Tenth Australian Workshop on Combinatorial Algorithms (R. Raman and J. Simpson, eds.), AWOCA'99, School of Computing, Curtin University of Technology, Perth 1999, pp. 129-144.   CrossRef
  11. B. Commentz-Walter: A string matching algorithm fast on the average. In: Proc. 6th International Colloquium on Automata, Languages and Programming (H. A. Maurer, ed.), Graz 1979, Lect. Notes Comput. Sci. 71 (1979), Springer-Verlag, Berlin, pp. 118-132.   CrossRef
  12. M. Crochemore: Transducers and repetitions. Theor. Comput. Sci. 45 (1986), 1, 63-86.   CrossRef
  13. F. Damerau: A technique for computer detection and correction of spelling errors. Commun. ACM 7 (1964), 3, 171-176.   CrossRef
  14. B. D{ö}m{ö}lki: An algorithm for syntactical analysis. Comput. Linguistics 3 (1964), 29-46, 1964.   CrossRef
  15. M. J. Fischer and M. Paterson: String matching and other products. In: Proc. SIAM-AMS Complexity of Computation (R. M. Karp, ed.), Providence 1974, pp. 113-125.   CrossRef
  16. K. Fredriksson and M. Mozgovoy: Efficient parameterized string matching. Inf. Process. Lett. 100 (2006), 3, 91-96.   CrossRef
  17. Z. Galil: Open problems in stringology. In: Combinatorial Algorithms on Words (A. Apostolico and Z. Galil, eds.), NATO ASI Series F 12 (1985), Springer-Verlag, Berlin, pp. 1-8.   CrossRef
  18. R. W. Hamming: Error detecting and error correcting codes. The Bell System Techn. J. 29 (1950), 2, 147-160.   CrossRef
  19. J. Holub: Bit parallelism-NFA simulation. In: Implementation and Application of Automata (B. W. Watson and D. Wood, eds.), Lect. Notes in Comput. Sci. 2494 (2002), Springer-Verlag, Heidelberg, pp. 149-160.   CrossRef
  20. J. Holub: Finite automata in pattern matching. In: Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications (M. Elloumi and A. Y. Zomaya, eds.), John Wiley and Sons, 2011, pp. 51-71.   CrossRef
  21. J. Holub and T. Kadlec: NFA simulation using deterministic state cache. In: London Algorithmics 2008: Theory and Practice (J. Chan, J. W. Daykin, and M. S. Rahman, eds.), College Publications 2009, pp. 152-166.   CrossRef
  22. J. Holub and B. Melichar: Implementation of nondeterministic finite automata for approximate pattern matching. In: Proc. 3rd International Workshop on Implementing Automata, Rouen 1998, pp. 74-81.   CrossRef
  23. J. Holub and B. Melichar: Approximate string matching using factor automata. Theor. Comput. Sci. 249 (2000), 2, 30-311.   CrossRef
  24. J. Holub and W. F. Smyth: Algorithms on indeterminate strings. In: Proc. 14th Australasian Workshop On Combinatorial Algorithms (M. Miller and K. Park, eds.), Seoul National University, Seoul 2003, pp. 36-45.   CrossRef
  25. J. Holub, W. F. Smyth and S. Wang: Hybrid pattern-matching algorithms on indeterminate strings. In: London Stringology Day and London Algorithmic Workshop 2006 (J. Daykin, M. Mohamed, and K. Steinhoefel, eds.) King's College London Series Texts in Algorithmics 2007, pp. 115-133.   CrossRef
  26. J. Holub, W. F. Smyth and S. Wang: Fast pattern-matching on indeterminate strings. J. Discret. Algorithms 6 (2008), 1, 37-50.   CrossRef
  27. J. Holub and P. {Š}piller: Practical experiments with NFA simulation. In: Proc. Eindhoven FASTAR Days 2004 (L. Cleophas and B. W. Watson, eds.), TU Eindhoven 2004, The Finite Automata Approaches in Stringology 15, pp. 73-95.   CrossRef
  28. J. E. Hopcroft and J. D. Ullman: Introduction to automata, languages and computations. Addison-Wesley, Reading 1979.   CrossRef
  29. D. A. Huffman: The synthesis of sequential switching circuits. J. Franklin Inst. 257 (1954), 161-190, 275-303.   CrossRef
  30. D. A. Huffman: A method for the construction of minimum-redundancy codes. Proc. Inst. Radio Engrs 40 (1952), 9, 1098-1101.   CrossRef
  31. C. S. Iliopoulos, I. Jayasekera, B. Melichar and J. Šupol: Weighted degenerated approximate pattern matching. In: Proc. 1st International Conference on Language and Automata Theory and Applications, Tarragona 2007.   CrossRef
  32. S. C. Kleene: Representation of events in nerve nets and finite automata. Automata Studies (1956), 3-42.   CrossRef
  33. D. E. Knuth, J. H. Morris, Jr and V. R. Pratt: Fast pattern matching in strings. SIAM J. Comput. 6 (1977), 2, 323-350.   CrossRef
  34. S. Kurtz: Reducing the space requirements of suffix trees. Softw. Pract. Exp. 29 (1999), 13, 1149-1171.   CrossRef
  35. J. Lahoda and B. Melichar: Pattern matching in text coded by finite translation automaton. In: Proc. 7th International Multiconference Information Society (M. Heričko et al., eds.), Institut Jožef Stefan, Ljubljana 2004, pp. 212-214.   CrossRef
  36. J. Lahoda and B. Melichar: General pattern matching on regular collage system. In: Proc. Prague Stringology Conference'05 (J. Holub and M. Šimánek, eds.), Czech Technical University in Prague 2005, pp. 153-162.   CrossRef
  37. V. I. Levenshtein: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 6 (1966), 707-710.   CrossRef
  38. W. S. McCulloch and W. Pitts: A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5 (1943), 115-133.   CrossRef
  39. G. H. Mealy: A method for synthetizing sequential circuits. Bell System Techn. J. 34 (1955), 5, 1045-1079.   CrossRef
  40. B. Melichar: Repetitions in text and finite automata. In: Proc. Eindhoven FASTAR Days 2004 (L. Cleophas and B. W. Watson, eds.), TU Eindhoven 2004, pp. 1-46.   CrossRef
  41. B. Melichar and J. Holub: 6D classification of pattern matching problems. In: Proceedings of the Prague Stringology Club Workshop'97 (J. Holub, ed.), Czech Technical University in Prague, 1997, pp. 24-32. Collaborative Report DC-97-03.   CrossRef
  42. B. Melichar, J. Holub and T. Polcar: Text searching algorithms. Vol. I: Forward string matching. Textbook for course Text Searching Algorithms, 2005.   CrossRef
  43. E. F. Moore: Gedanken experiments on sequential machines. Automata Studies (1965), 129-153.   CrossRef
  44. J. H. Morris, Jr and V. R. Pratt: A Linear Pattern-Matching Algorithm. Report 40, University of California, Berkeley 1970.   CrossRef
  45. G. Navarro and M. Raffinot: A bit-parallel approach to suffix automata: Fast extended string matching. In: Proc. 9th Annual Symposium on Combinatorial Pattern Matching (M. Farach-Colton, ed.), Lect. Notes in Comput. Sci. 1448, Piscataway, NJ, 1998. Springer-Verlag, Berlin, pp. 14-33.   CrossRef
  46. M. O. Rabin and D. Scott: Finite automata and their decision problems. IBM J. Res. Dev. 3 (1959) 114-125.   CrossRef
  47. P. H. Sellers: The theory and computation of evolutionary distances: Pattern recognition. J. Algorithms 1 (1980), 4, 359-373.   CrossRef
  48. R. K. Shyamasundar: A Simple String Matching Algorithm. Technical Report, Tata Institute of Fundamental Research 1976.   CrossRef
  49. M. {Š}im{\accent23 u}nek and B. Melichar: Borders and finite automata. In: Implementation and Application of Automata (O. H. Ibarra and H.-C. Yen, eds.), Lecture Notes in Comput. Sci. 4094, Springer-Verlag, Heidelberg 2006, pp. 58-68.   CrossRef
  50. E. Ukkonen: Finding approximate patterns in strings. J. Algorithms 6 (1985), 1-3, 132-137.   CrossRef
  51. M. Voráček, L. Vagner, S. Rahman and C. S. Iliopoulos: The constrained longest common subsequence problem for degenerate strings. In: Implementation and Application of Automata (J. Holub and J. Žďárek, eds.), Lecture Notes in Comput. Sci. 4783, Springer-Verlag, Heidelberg 2007, pp. 309-311.   CrossRef
  52. S. Wu and U. Manber: Fast text searching allowing errors. Commun. ACM 35 (1992), 10, 83-91.   CrossRef
  53. J. Ziv and A. Lempel: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24 (1978), 530-536.   CrossRef