Kybernetika 40 no. 3, 381-396, 2004

Multidimensional term indexing for efficient processing of complex queries

Michal Krátký, Tomáš Skopal and Václav Snášel

Abstract:

The area of \emph{Information Retrieval} deals with problems of storage and retrieval within a huge collection of text documents. In IR models, the semantics of a document is usually characterized using a set of terms. A common need to various IR models is an efficient term retrieval provided via a term index. Existing approaches of term indexing, e. g. the inverted list, support efficiently only simple queries asking for a term occurrence. In practice, we would like to exploit some more sophisticated querying mechanisms, in particular queries based on regular expressions. In this article we propose a multidimensional approach of term indexing providing efficient term retrieval and supporting regular expression queries. Since the term lengths are usually different, we also introduce an improvement based on a new data structure, called \emph{BUB-forest}, providing even more efficient term retrieval.