String Matching - PowerPoint PPT Presentation

String Matching. String Matching. Problem is to find if a pattern p of length m occurs within text t of length n Simple solution: Naïve String Matching Match each position in the pattern to each position in the text t = AAAAAAAAAAAAAA p = AAAAAB AAAAAB etc.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

PowerPoint Slideshow about 'String Matching' - kineks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

To get a feel for the idea say that our text and pattern is a sequence of bits.

For example,

p=010111

t=0010110101001010011

The parity of a binary value is to count the number of one’s. If odd, the parity is 1. If even, the parity is 0. Since our pattern is six bits long, let’s compute the parity for each position in t, counting six bits ahead. Call this f[i] where f[i] is the parity of the string t[i..i+5].

Since the parity of our pattern is 0, we only need to check positions 2, 4, 6, 8, 10, and 11 in the text. By the way, how do we compute parity of all substrings of length m in just order n, because if we do all n-m+1 substrings separately, that will already cost us m(n-m+1) units of time.

We can compute f[i] in O(m) time giving us the expected runtime of O(m+n), given a good hash function. This can be a worst case of mn if we get significant hash conflicts. Of course, we could try doing just probabilistic.

It is possible in some cases to search text of length n in less than n comparisons!

Horspool’s algorithm is a relatively simple technique that achieves this distinction for many (but not all) input patterns. The idea is to perform the comparison from right to left instead of left to right.

1. There is no occurrence of the character in T in P. In this case there is no use shifting over by one, since we’ll eventually compare with this character in T that is not in P. Consequently, we can shift the pattern all the way over by the entire length of the pattern (m):

4. If we’ve done some matching until we hit a character that doesn’t match in P, but exists among its first m-1 characters. In this case, the shift should be like case 2, where we match the last character in T with the next corresponding character in P:

We first precompute the shifts and store them in a table. The table will be indexed by all possible characters that can appear in a text. To compute the shift T(c) for some character c we use the formula:

T(c) = the pattern’s length m, if c is not among the first m-1 characters of P, else the distance from the rightmost occurrence of c in P to the end of P