Citations

...hen the text is compressed. Text compression [5] exploits the redundancies of the text to represent it using less space. There are many dierent compression schemes, among which the Ziv-Lempel family [=-=35, 36]-=- is one of the best in practice because of its good compression ratios combined with ecient compression and decompression times. The compressed matching problem consists of searching for a pattern on ...

...hen the text is compressed. Text compression [5] exploits the redundancies of the text to represent it using less space. There are many dierent compression schemes, among which the Ziv-Lempel family [=-=35, 36]-=- is one of the best in practice because of its good compression ratios combined with ecient compression and decompression times. The compressed matching problem consists of searching for a pattern on ...

...gular expression searching is quite old and has received continuous attention since the sixties. A particularly interesting case of text searching arises when the text is compressed. Text compression =-=[5-=-] exploits the redundancies of the text to represent it using less space. There are many dierent compression schemes, among which the Ziv-Lempel family [35, 36] is one of the best in practice because ...

...ing to notice that any solution for compressed regular expression searching implies a solution for compressed approximate string matching, as the latter can be expressed as the output of an automaton =-=[-=-20]. Consider the NFA for k = 2 dierences shown in Figure 4. Every row denotes the number of dierences seen (thesrst row zero, the second row one, etc.). Every column represents matching a pattern pre...

...is is the problem we solve in this paper: we present thesrst solution for compressed regular expression searching. The format we choose is the Ziv-Lempel family, focusing in the LZ78 and LZW variants =-=[36, 33]-=-. Given a text of length u compressed into length n, we are able tosnd the R occurrences of a regular expression of length m in O(2 m +mn+Rm log m) worst case time, needing O(2 m + mn) space. We also ...

...t-parallel simulation of an NFA, or as an implementation of a DFA (where the identier of each deterministic state is the bit mask as a whole). This idea has been used several times, under Thompson's [=-=34-=-] and Glushkov's [27] constructions. By using dierent properties of the constructions, both manage to implement the transition function D using O(2 m ) space (actually, the Thompson-based version [34]...

...mbination of active/inactive NFA states becomes a single DFA state. Given a regular expression E, there are several techniques to produce an NFA that recognizes L(E). The most classical is Thompson's =-=[30]-=-. Given an expression of length m, this method produces an NFA of at most 2m states and 4m edges. A less popular one is Glushkov's [9], which produces an NFA of exactly m+1 states but O(m 2 ) edges. T...

...[9], which produces an NFA of exactly m+1 states but O(m 2 ) edges. Tosx ideas we will assume in this paper that we build NFAs using the version of Glushkov's algorithm popularized by Berry and Sethi =-=[6]-=-. The problem of searching for a regular expression E in a given text string T is that ofsnding all the text substrings that belong to L(E). These are called occurrences. For simplicity, we report the...

...gth of a string matching the regular expression and forms a trie with all the prexes 5 of that length of strings matching the regular expression. A multipattern search algorithm like Commentz-Walter [=-=-=-7] is run over those prexes as aslter to detect text areas where a complete occurrence may start. Those areas are then veried with a classical algorithm. Another technique of this kind is used in Gnu ...

...orithm for exact searching is from 1994, by Amir, Benson and Farach [3], who search LZ78 compressed texts needing time and space O(m 2 + n). The only search technique for LZ77 is by Farach and Thorup =-=[8]-=-, a randomized algorithm to determine in time O(m + n log 2 (u=n)) whether a pattern is present or not in the text. An extension of thesrst work [3] to multipattern searching was presented by Kida et ...

...ce an NFA that recognizes L(E). The most classical is Thompson's [30]. Given an expression of length m, this method produces an NFA of at most 2m states and 4m edges. A less popular one is Glushkov's =-=[9]-=-, which produces an NFA of exactly m+1 states but O(m 2 ) edges. Tosx ideas we will assume in this paper that we build NFAs using the version of Glushkov's algorithm popularized by Berry and Sethi [6]...

...thod is also dierent: instead of a Boyer-Moore like algorithm, it is based on BNDM [26]. 3.2 Compressed Pattern Matching The compressed matching problem wassrst dened in the work of Amir and Benson [2=-=]-=- as the task of performing string matching in a compressed text without decompressing it. Given a text T , a corresponding compressed string Z = z 1 : : : z n , and a pattern P , the compressed matchi...

...], but they need the text to contain natural language and be large (say, 10 Mb or more). Moreover, they allow only searching for whole words and phrases. There are also other practical ad-hoc methods =-=[15-=-], but the compression they obtain is poor. Moreover, in these compression formats n = (u), so the speedups can only be measured in practical terms. The second line of research considers Ziv-Lempel co...

...re R is the number of matches (note that it could be that R = u > n). Two dierent approaches exist to search compressed text. Thesrst one is rather practical. Ecient solutions based on Human coding [1=-=0]-=- on words have been presented by Moura et al. [18], but they need the text to contain natural language and be large (say, 10 Mb or more). Moreover, they allow only searching for whole words and phrase...

...srst experimental results in this area. They achieve O(m 2 + n) time and space, although this time m is the total length of all the patterns. New practical results were presented by Navarro and Ranot =-=[25]-=-, who proposed a general scheme to search Ziv-Lempel compressed texts (simple and extended patterns) and specialized it for the particular cases of LZ77, LZ78 and a new variant proposed that was compe...

... 2 subtables of size 2 m=2 . We need to access two tables for a transition but need only the square root of the space. Some techniques have been proposed to obtain a tradeo between NFAs and DFAs. In [=-=19] a fo-=-ur-russians approach is presented that obtains O(mu= log u) worst-case time and extra space. The idea is to divide the syntax tree of the regular expression into \modules&quot;, which are subtrees of ...

...randomized algorithm to determine in time O(m + n log 2 (u=n)) whether a pattern is present or not in the text. An extension of thesrst work [3] to multipattern searching was presented by Kida et al. =-=[13]-=-, together with thesrst experimental results in this area. They achieve O(m 2 + n) time and space, although this time m is the total length of all the patterns. New practical results were presented by...

...lt, restricted to the LZW format, was independently 6 found and presented by Kida et al. [14]. The same group generalized the existing algorithms and nicely unied the concepts in a general framework [=-=12-=-]. Recently, Navarro and Tarhio [28] presented a new, faster, algorithm based on Boyer-Moore. Approximate string matching on compressed text aims atsnding the pattern where a limited number of dierenc...

...f LZ77, LZ78 and a new variant proposed that was competitive and convenient for search purposes. A similar result, restricted to the LZW format, was independently 6 found and presented by Kida et al. =-=[14-=-]. The same group generalized the existing algorithms and nicely unied the concepts in a general framework [12]. Recently, Navarro and Tarhio [28] presented a new, faster, algorithm based on Boyer-Moo...

...Lempel compressed texts is much more complex, since the pattern can appear in dierent forms across the compressed text. Thesrst algorithm for exact searching is from 1994, by Amir, Benson and Farach [=-=3]-=-, who search LZ78 compressed texts needing time and space O(m 2 + n). The only search technique for LZ77 is by Farach and Thorup [8], a randomized algorithm to determine in time O(m + n log 2 (u=n)) w...

...s are searched for and the areas where they appear are checked for complete occurrences using a lazy deterministic automaton (i.e., built on thesy). The most recent development, also in this line, is =-=[24-=-]. They invert the arrows of the DFA and make all states initial and the initial statesnal. The result is an automaton that recognizes all the reverse prexes of strings matching the regular expression...

...ds [18], but the solution is limited to search for a whole word and retrieve whole words that are similar. Thesrst true solutions appeared very recently, by Karkkainen et al. [11], Matsumoto et al. [1=-=6]-=- and Navarro et al. [23]. 4 A Search Algorithm We present now our approach for regular expression searching a text Z = b 1 : : : b n , which is expressed by the LZ78 algorithm as a sequence of n block...

...n of an NFA, or as an implementation of a DFA (where the identier of each deterministic state is the bit mask as a whole). This idea has been used several times, under Thompson's [34] and Glushkov's [=-=27-=-] constructions. By using dierent properties of the constructions, both manage to implement the transition function D using O(2 m ) space (actually, the Thompson-based version [34] may need O(2 2m ) s...

...ching the regular expression. The idea is in this sense similar to that of [32], but takes less space. The search method is also dierent: instead of a Boyer-Moore like algorithm, it is based on BNDM [=-=26-=-]. 3.2 Compressed Pattern Matching The compressed matching problem wassrst dened in the work of Amir and Benson [2] as the task of performing string matching in a compressed text without decompressing...

...be that R = u > n). Two dierent approaches exist to search compressed text. Thesrst one is rather practical. Ecient solutions based on Human coding [10] on words have been presented by Moura et al. [1=-=8]-=-, but they need the text to contain natural language and be large (say, 10 Mb or more). Moreover, they allow only searching for whole words and phrases. There are also other practical ad-hoc methods [...

...at a good implementation of the automaton, but they must inspect all the text characters. Other proposals try to skip some text characters, as it is usual for simple pattern matching. For example, in =-=[32-=-] they present an algorithm that determines the minimum length of a string matching the regular expression and forms a trie with all the prexes 5 of that length of strings matching the regular express...

...ithms. Asrst one, DFA, uses a bit-parallel DFA to process the text [27]. This is interesting because it is the algorithm we are modifying to work on compressed text. A second one, the software nrgrep =-=[21]-=-, uses a character skipping technique for searching [24, 27], which is much faster. In any case, the time to decompress is an order of magnitude higher than that to search the uncompressed text, so th...

...gorithms on uncompressed text, showing that we can search the compressed text twice as fast as the nave approach of decompressing and then searching. A preliminary version of this paper appeared in [2=-=-=-2]. 2 Basic Concepts 2.1 Strings, Regular Expressions and Automata We give a very basic introduction to the subject. For more details see, for example, [1]. Given an alphabet (nite set of symbols) of...