Nội dung Text: Thuật toán Algorithms (Phần 27)

STRING SEARCHTNG 253
This leads to the very simple pattelm-matching algorithm implemented
below. The program assumes the same i.ldex function as above, but d=32 is
used for efficiency (the multiplications might be implemented as shifts).
function rksearch : integer;
const q=33554393; d=3.Z;
var hl, h2, dM, i: integer:
begin
dM:=l; for i:=l to M-1 do dM:=(d*dM) mod q;
hl:=O; for i:=l to M do hl:=(hl*d+index(p[i])) mod q;
h2:=0; for i:=l to M do h2:=(h2*d+index(a[i])) mod q;
i:=l;
while (hloh2) and (i

254 CHAPTER 19
Multiple Searches
The algorithms that we’ve been discussing are all oriented towards a specific
string searching problem: find an occurrence of a given pattern in a given
text string. If the same text string is to be the object of many pattern
searches, then it will be worthwhile to do some processing on the string to
make subsequent searches efficient.
If there are a large number of searches, the string searching problem can
be viewed as a special case of the general searching problem that we studied
in the previous section. We simply treat the text string as N overlapping
“keys,” the ith key defined to be a[l..N], the entire text string starting at
position i. Of course, we don’t manipulate the keys themselves, but pointers
to them: when we need to compare keys i and j we do character-by-character
compares starting at positions i and j of the text string. (If we use a “sentinel”
character larger than all other characters at the end, then one of the keys
will always be greater than the other.) Then the hashing, binary tree, and
other algorithms of the previous section can be used directly. First, an entire
structure is built up from the text string, and then efficient searches can be
performed for particular patterns.
There are many details which need to be worked out in applying searching
algorithms to string searching in this way; our intent is to point this out as
a viable option for some string searching applications. Different methods will
be appropriate in different situations. For example, if the searches will always
be for patterns of the same length, a hash table constructed with a single scan
as in the Rabin-Karp method will yield constant search times on the average.
On the other hand, if the patterns are to be of varying length, then one of the
tree-based methods might be appropriate. (Patricia is especially adaptable to
such an application.)
Other variations in the problem can make it significantly more difficult
and lead to drastically different methods, as we’ll discover in the next two
chapters.
r-l

STRING SEARCHING 255
Exercises
1. Implement a brute-force pattern ms.tching algorithm that scans the pat-
tern from right to left.
2. Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AAIWUAA.
3. Give the next table for the Knuth-Morris-Pratt algorithm for the pattern
AERACADABRA.
4. Draw a finite state machine which can search for the pattern AE3RACAD
AFBA.
5 . How would you search a text file fo; a string of 50 consecutive blanks?
6. Give the right-to-left skip table for the right-left scan for the pattern
AENACADABRA.
7. Construct an example for which the right-to-left pattern scan with only
the mismatch heuristic performs badly.
8. How would you modify the Rabin-Karp algorithm to search for a given
pattern with the additional proviso that the middle character is a “‘wild
card” (any text character at all can match it)?
9. Implement a version of the Rabin-Karp algorithm that can find a given
two-dimensional pattern in a given two-dimensional text. Assume both
pattern and text are rectangles of characters.
10. Write programs to generate a random lOOO-bit text string, then find all
occurrences of the last /c bits elsewhere in the string, for k = 5,10,15.
(Different methods might be appropriate for different values of k.)

20. Pattern Matching
It is often desirable to do string searching with somewhat less than
complete information about the pattern to be found. For example, the
user of a text editor may wish to specif;r only part of his pattern, or he may
wish to specify a pattern which could match a few different words, or he might
wish to specify that any number of occurrences of some specific characters
should be ignored. In this chapter we’ll consider how pattern matching of this
type can be done efficiently.
The algorithms in the previous chapter have a rather fundamental depen-
dence on complete specification of the pattern, so we have to consider different
methods. The basic mechanisms that we will consider make possible a very
powerful string searching facility which can match complicated M-character
patterns in N-character text strings in time proportional to MN.
First, we have to develop a way to describe the patterns: a “language”
that can be used to specify, in a rigorous way, the kinds of partial string
searching problems suggested above. Th .s language will involve more powerful
primitive operations than the simple “check if the ith character of the text
string matches the jth character of the pattern” operation used in the previous
chapter. In this chapter, we consider three basic operations in terms of an
imaginary type of machine that has the capability for searching for patterns
in a text string. Our pattern-matching algorithm will be a way to simulate
the operation of this type of machine. [n the next chapter, we’ll see how to
translate from the pattern specification (which the user employs to describe
his string searching task) to the machine specification (which the algorithm
employs to actually carry out the search).
As we’ll see, this solution to the pattern matching problem is intimately
related to fundamental processes in computer science. For example, the
method that we will use in our program to perform the string searching task
implied by a given pattern description is akin to the method used by the
257

258 CHAPTER 20
Pascal system to perform the computational task implied by a given Pascal
program.
Describing Patterns
We’ll consider pattern descriptions made up of symbols tied together with the
following three fundamental operations.
(i) Concatenation. This is the operation used in the last chapter. If two
characters are adjacent in the pattern, then there is a match if and only
if the same two characters are adjacent in the text. For example, AI3
means A followed by B.
(ii) Or. This is the operation that allows us to specify alternatives in the
pattern. If we have an “or” between two characters, then there is a
match if and only if either of the characters occurs in the text. We’ll
denote this operation by using the symbol + and use parentheses to
allow it to be combined with concatenation in arbitrarily complicated
ways. For example, A+B means “either A or B”; C(AC+B)D means
“either CACD or CBD”; and (A+C)((B+C)D) means “either ABD or
CBD or ACD or CCD.”
(iii) Closure. This operation allows parts of the pattern to be repeated
arbitrarily. If we have the closure of a symbol, then there is a match if
and only if the symbol occurs any number of times (including 0). Closure
will be denoted by placing a * after the character or parenthesized group
to be repeated. For example, AB* matches strings consisting of an A
followed by any number of B’s, while (AB)” matches strings consisting
of alternating A’s and B’s.
A string of symbols built up using these three operations is called a regular
expression. Each regular expression describes many specific text patterns.
Our goal is to develop an algorithm which will determine if any of the patterns
described by a given regular expression occur in a given text string.
We’ll concentrate on concatenation, or, and closure in order to show
the basic principles in developing a regular-expression pattern matching al-
gorithm. Various additions are commonly made in actual systems for con-
venience. For example, -A might mean “match any character except A.”
This not operation is the same as an or involving all the characters except
A but is much easier to use. Similarly, “7” might mean “match any letter.”
Again, this is obviously much more compact than a large or. Other examples
of additional symbols which might make specification of large patterns easier
are symbols which match the beginning or end of a line, any letter or any
number, etc.
These operations can be remarkably descriptive. For example, the pattern
description ?*(ie + ei)?’ matches all words which have ie or ei in them (and so

PATTERN h4ATCHliVG
are likely to be misspelled!); and (1 + 01)’ (0 + 1) describes all strings of O’s and
l’s which do not have two consecutive 0’~ Obviously there are many different
pattern descriptions which describe the same strings: we must try to specify
succinct pattern descriptions just as we ;ry to write efficient algorithms.
The pattern matching algorithm that we’ll examine may be viewed as
a generalization of the brute force left-to-right string searching method (the
first method that we looked at in Chapter 19). The algorithm looks for the
leftmost substring in the text string which matches the pattern description by
scanning the text string from left to rignt, testing, at each position whether
there is a substring beginning at that position which matches the pattern
description.
Pattern Matching Machines
Recall that we can view the Knuth-Morris-Pratt algorithm as a finite-state
machine constructed from the search pattern which scans the text. The
method we will use for regular-expressior pattern matching is a generalization
of this.
The finite-state machine for the Knuth-Morris-Pratt algorithm changes
from state to state by looking at a character from the text string and then
changing to one state if there’s a match, to another state if not. A mismatch
at any point means that the pattern couldn’t occur in the text starting at that
point. The algorithm itself can be thought of as a simulation of the machine.
The characteristic of the machine that makes it easy to simulate is that it
is deterministic: each state transition is completely determined by the next
input character.
To handle regular expressions, it will be necessary to consider a more
powerful abstract machine. Because of the or operation, the machine can’t
determine whether or not the pattern could occur at a given point by examin-
ing just one character; in fact, because 0:’ closure, it can’t even determine how
many characters might need to be exam:ned before a mismatch is discovered.
The most natural way to overcome these problems is to endow the machine
with the power of nondeterminism: when faced with more than one way to
try to match the pattern, the machine should “guess” the right one! This
operation seems impossible to allow, but, we will see that it is easy to write a
program to simulate the actions of such a machine.
For example, the following diagram shows a nondeterministic finite-state
machine that could be used to search for the pattern description (A*B+AC)D
in a text string.

260 CHAPTER 20
As in the deterministic machine of the previous chapter, the machine can
travel from a state labeled with a character to the state “pointed to” by that
state by matching (and scanning past) that character in the text string. What
makes the machine nondeterministic is that there are some states (called null
states) which not only are not labeled, but also can “point to” two different
successor states. (Some null states, such as sta.te 4 in the diagram, are “no-
op” states with one exit, which don’t affect the operation of the machine,
but which make easier the implementation of the program which constructs
the machine, as we’ll see. State 9 is a null state with no exits, which stops
the machine.) When in such a state, the machine can go to either successor
state regardless of what’s in the input (without scanning past anything). The
machine has the power to guess which transition will lead to a match for the
given text string (if any will). Note that there are no “non-match” transitions
as in the previous chapter: the machine fails no find a match only if there is
no way even to guess a sequence of transitions that leads to a match.
The machine has a unique initial state (which is pointed to by a “free”
arrow) and a unique final state (which has no arrows going out). When started
out in the initial state, the machine should be able to “recognize” any string
described by the pattern by reading characters and changing state according
to its rules, ending up in the “final state.” Because it has the power of
nondeterminism, the machine can guess the sequence of state changes that
can lead to the solution. (But when we try to simulate the machine on a
standard computer, we’ll have to try all the possibilities.) For example, to
determine if its pattern description (A*B+AC)D can occur in the text string
CDAABCAAABDDACDAAC
the machine would immediately report failure if started on the first or second
character; it would work some to report failure on the next two characters; it
would immediately report failure on the fifth or sixth characters; and it would
guess the sequence of state transitions

PATTERN iMATCHING 261
5 2 2 1 2 1 2 3 4 8 9
to recognize AAAE%D if started on the seventh character.
We can construct the machine for a given regular expression by building
partial machines for parts of the expref#sion and defining the ways in which
two partial machines can be composed into a larger machine for each of the
three operations: concatenation, or, and closure.
We start with the trivial machine to recognize a particular character. It’s
convenient to write this as a two-state machine, with one initial state (which
also recognizes the character) and one final state, as below:
-49-O
Now to build the machine for the concatenation of two expressions from the
machines for the individual expressions, we simply merge the final state of
the first with the initial state of the second:
Similarly, the machine for the or operation is built by adding a new null state
pointing to the two initial states, and making one final state point to the
other, which becomes the final state of l,he combined machine.
Finally, the machine for the closure operation is built by making the final
state the initial state and making it point back to the old initial state and a
new final state.

262 CHAPTER 20
A machine can be built which corresponds to any regular expression by
successively applying these rules. The numbers of the states for the example
machine above are in order of creation as the machine is built by scanning
the pattern from left to right, so the construction of the machine from the
rules above can be easily traced. Note that we have a 2-state trivial machine
for each letter in the regular expression, and each + and * causes one state to
be created (concatenation causes one to be deleted) so the number of states is
certainly less than twice the number of characters in the regular expression.
Representing the Machine
Our nondeterministic machines will all be constructed using only the three
composition rules outlined above, and we can take advantage of their simple
structure to manipulate them in a straightforward way. For example, no state
has more than two arrows leaving it. In fact, there are only two types of states:
those labeled by a character from the input alphabet (with one arrow leaving)
and unlabeled (null) states (with two or fewer arrows leaving). This means
that the machine can be represented with only a few pieces of information
per node. For example, the machine above might be represented as follows:
State Character Next I Next 2
0 5 -
1 A 2
2 3 1
3 B 4 -
4 8 8
5 6 2
6 A 7 -
7 C 8 -
8 D 9
9 0 0
The rows in this table may be interpreted as instructions to the nondeter-
ministic machine of the iorm “If you are in State and you see Character then
scan the character and go to state Next I (or Next 2)” State 9 is the final
state in this example, and State 0 is a pseudo-initial state whose Next I entry