Chapter 19: Repetitive Searches

There are only two or three human stories, and they go on repeating themselves as fiercely as if they had never happened before.

OPioneers!WillaCather

Did you spot the problem with the example program that searched for code snippets in a text file at the beginning of Chapter 18? In lines that have multiple code snippets, everything between the first "<CODE>" and the last "</CODE>" is listed as a single snippet. To separate multiple snippets, we first have to change the regular expression a bit so that it doesn't swallow multiple snippets. In this case, we can replace the ".*"with a nongreedy repetition:

stringexpr="<CODE>(.*?)</CODE>";

Now, to resume searching after the text that matched, we have to change the code. To do that, instead of searching the entire line from the file, we use a pair of iterators that point at the contents of the line. After a match, we advance the first iterator to point at the character immediately following the match and search again.

Don't be fooled, though: Repetitive searches aren't usually that easy to write. For example, if the regular expression begins with a "^",simply restarting the search after a match, as the previous example does, can lead to wrong answers. The following program searches the target text "abcdef" for subsequences that match the regular expression "^(abc|def)". The only one is the initial "abc", but the program finds two, reporting that "def" also matches.

In this chapter, we look first at the complications that any repetitive search has to allow for and the techniques for fixing problems (Section 19.1). Then we look at prewritten solutions, in the form of the class template regex_iterator (Section 19.2) and the class template regex_token_iterator (Section 19.3).

19.1 Brute-ForceSearches

In Chapter 17 we looked at several flags that you can pass to the regular expression search functions to change the details of regular expression matching. Here, we look at some of those flags again but in the context of specific problems that arise in repetitive searches. Eventually, we'll build a search function that avoids these problems; you can judge for yourself whether that's a better approach than using the two forms of regular expression iterator that the TR1 library provides.

19.1.1 Lost Anchors

Earlier in this chapter, we looked at a naive search function that reported two matches when applying the regular expression "^(abc|def)"to the target text "abcdef". The problem with simply repeating the same search at a new location in the target text, as that program did, is that on the second call to regex_search, the target text is passed, effectively, as "def", which does match the regular expression. That is, we chopped off the start of the target text but didn't tell the search function that we had done that, so it matched the "^"at the beginning of the regular expression to the beginning of the text that we passed, even though the text was not the beginning of the target text. The solution to this problem is simply to tell the search function that we're not at the beginning of a line, so "^"shouldn't match. To do that, we use the flag match_not_bol for all searches except the first.

19.1.2 Lost Word Boundaries

The regular expression "\babc" should match the target text "abcabc" in one place: the first occurrence of the character sequence "abc". The second "abc" doesn't match, because it doesn't start at a word boundary. If you try the previous search function with this regular expression and target text, it will find two matches. The problem is similar to the one with lost anchors: When we restart the search after the first match, the regular expression engine treats the start of the text as a word boundary. You might be tempted to fix that with the same approach we used before, by adding the flag match_not_-bow after a successful match. But the two cases are different: A "^"can match only at the beginning of the original target text, so it's okay to simply disallow that match once we've moved away from the beginning of the text. A word boundary can occur inside the target text as well as at the beginning, so we have to be careful to disable matching the beginning of a word only when we're not at the beginning of a word. That can be done by checking whether the last character in a match can be in a word and, if so, prohibiting matching the beginning of a word on the next pass. That solves half the problem.

The other half of the problem occurs with a regular expression like "\b3", when matched against the target text "33". The first "3"is at a word boundary, so it should match. The second "3"is not at a word boundary, so it should not match. But the previous version of search will find that the second one matches because in the target text that's passed for the second search, it is at the beginning of the target text. So we also need to disable matching of the end of a word when the previous character cannot be in a word.

But there's an easier way. The regular expression engine already knows how to identify characters that can be in a word, so we don't need to write that logic ourselves. All we need to do is tell the engine that it can look at the character in front of the target text to decide whether it's at the beginning of a word. That's what the flag match_prev_avail does. Of course, we should do that only when we know that a valid character is in front of the target text. Once we've moved forward in the target text, we know that we can look behind the current position.

Repetitive Searches

WEBINAR:On-Demand

19.1.3 Empty Matches

To understand the problem that empty matches pose, we first need to look at empty matches in more detail. The regular expression "a*" matches a sequence of zero or more repetitions of the character 'a'. When it matches zero characters, that's an empty match. If you call regex_search to see whether that regular expression matches the target text "bcd", the answer will be that it matches, right at the beginning.

If we use the search function that we wrote to eliminate lost anchors to search for all occurrences of "a*"in the target text "bcd", we'll get into trouble. The first match is at offset 0, and its length is 0, so the function will adjust the position in the target string by zero characters and call regex_-search again. This will loop until you get bored and terminate the program.

There are two obvious solutions. First, move the position in the target text forward by one character when you get an empty match. Second, temporarily prohibit empty matches. Both work for some cases, but, as we'll see, you really need a combination of the two.

Note the test for first == last; without this, the function will increment first past the end of the target text if an empty match occurs at the end of the target text. This works fine for the regular expression "a*", but try it with the regular expression "a*|c". It doesn't see that the regular expression matches the "c" in the target text. That's because it finds the empty match at that position and jumps past it.

This version of search implements the second fix, using the flag match_-not_null to prevent empty matches until after the next successful match.

This program does, indeed, find the match of "c", but it's not right, because it misses the empty match before "c". We've shut off empty matches for too long. The fix is to shut off empty matches only at the current position in the target text. To do that, we need two changes. First, we need to add the flag match_continuous, so that the regular expression search engine won't look for matches that occur after the start of the target text. That way, we control when the search advances further into the target text. Second, if that constrained search fails, we need to turn off the constraint and move to the next position in the target text. That is, we need to combine the two previous attempted solutions.

Now we have a robust search function. It's a little difficult to reuse,1 though, because the action that it performs when it finds a match is embedded in the code that finds the match. Although this code can be made more generic, in most cases, you should use one of the two forms of iterator that the TR1 library provides, rather than trying to adapt this explicit loop.

The program creates a regular expression object, rgx, that holds the regular expression to search for. Then the program creates a regex_iterator object,2first, passing two iterators that delineate the target text and passing the regular expression object. The program also creates an end-of-sequence iterator, last. These two iterators describe a sequence of match_results objects, with successive elements in the sequence holding the results of successive repetitive searches. The program then creates an ostream_iterator<cmatch> object, which inserts cmatch objects into its target stream, using the operator<< that the program defined earlier, and passes all three iterators to the standard copy algorithm, which copies the contents of the range defined by [first,last) into the target, out. The tricky code that we had to write in the loop in the previous example is all handled in the regex_iterator's operator++, which is called inside copy.

The class template describes an object that can serve as a forward iterator for an unmodifiable sequence of character sequences that match a regular expression.

The template argument BidIt must be a bidirectional iterator. It names the type of the iterator that will designate the target character sequence when an iterator object is created. The template arguments Elem and Rxtraits name the character type and the traits type, respectively, for the regular expression type, basic_regex<Elem, Rxtraits>, that will be passed to a regex_iterator object's constructor. By default, these arguments are derived from the first type argument, BidIt.

You create a regex_iterator object by passing two iterators that delineate a character range to be searched and a basic_regex object that holds the regular expression to search for. The resulting object points at the first matching subsequence in the target sequence. Each application of operator++ advances the iterator to point at the next matching subsequence, until there are no more matching subsequences. At that point, the iterator compares equal to the end-of-sequence iterator, which is created with the default constructor.

The template defines several nested types (Section 19.2.1) and provides three constructors and an assignment operator (Section 19.2.2). An object can be dereferenced with operator* and operator-> (Section 19.2.3), and can be incremented, to point at the next element in the output sequence, with operator++ (Section 19.2.4). Two regex_iterator objects of the same type can be compared for equality (Section 19.2.5). Four predefined types for the most commonly used character types are described in Section 19.2.6.

The definition of this template includes several members marked as exposition only:. These members are used in the descriptions of some of this template's member functions that follow. Keep in mind that these members aren't required by TR1. The rule is that the member functions have to act as if they were implemented according to the descriptions.

19.2.1 Nested Types

typedef basic_regex <Elem , RXtraits > regex_type ;

The type is a synonym for basic_regex<Elem,RXtraits>.

The typedef names the type of the regular expression object that will be used in searches. In most cases the regular expression object traffics in the same element type as the target text, so Elem is simply the value type of the bidirectional iterator type BidIt. For example, if the target text to be searched is going to be designated by a const char*, the regular expression object will ordinarily have type basic_regex<char, regex_traits<char> >.This typedef is especially handy if you prefer qualified id's over using declarations.

The constructor constructs an object with initial values first and last equal to first1 and last1, respectively; pre equal to &re;3and flags equal to flgs. The constructor then calls regex_search(first, last, match, *pre, flags); if that call returns false, it marks the object as an end-of-sequence iterator.

In other words, the constructor stores the various search parameters, then searches for the first occurrence of text matching rein the range of characters pointed at by [first1,last1). If the search succeeds, the result is stored in the member data object match. If the search fails, there are no matches, and the object is marked as an end-of-sequence iterator, that is, an object that compares equal to a default-constructed object.

19.2.3 Dereferencing

The behavior of a program that calls either of these member operators on an end-of-sequence iterator is undefined. Otherwise, the first member operator returns a reference to the contained object match, and the second member operator returns a pointer to the contained object match.

The contained object match holds the results of the most recent successful search, so you can use these operators to look at those results, just as if you had written a call to regex_search yourself and passed a match_results object.

The operator next sets flags to flags | match_prev_avail and calls regex_search(start, last, match,*pre, flags). If the call returns false, the operator marks the object as an end-of-sequence iterator. The call returns *this.

Whenever a call to regex_search returns true, the operator adjusts the contents of match so that match.prefix().first is equal to the previous value of match[0].second; for each value of idx for which match[idx].matched is true, match[idx].position() returns the value of distance(begin, match[idx].first).

You probably recognized most of this text as a description of the repetitive search algorithm we developed in Section 19.1. But, the last paragraph adds a twist: Regardless of how it got there, the prefix after a successful search is the text from the end of the previous successful match up to the current match, and all the match positions are offsets from the start of the original text sequence.

Look at how the output showing the various matches is formatted in this example, which is similar to the previous one.

The first member operator returns true only if *this and right are both end-of-sequence iterators or if first == right.first, last==right.last, pre==right.pre, flags==right.flags, and match==right.match. The second member operator returns !(*this==right).

This rather lengthy description says what you'd expect: If you create two regex_iterator objects with the same arguments or by copying one onto the other, they compare equal. If you increment two equal iterators the same number of times, they still compare equal. As long as the searcheseither at construction or as part of an incrementsucceed, the object does not compare equal to an end-of-sequence iterator. When a search fails, as we saw earlier, the iterator object is marked as an end-of-sequence iterator; at that point, it compares equal to any other end-of-sequence iterator.

Dereferencing a regex_iterator object produces a match_results object that represents the current match. As we saw in several earlier examples, the returned object can, in turn, be used to get at various submatches of a successful match. A regex_token_iterator object provides direct access to submatches. When you construct a regex_token_iterator object, you pass an additional set of numeric arguments that designate the desired submatches. Each time you increment the iterator, it advances to the next submatch. When it runs out of submatches, the iterator moves to the next match and starts the list of submatches over again. So the explicit loop over submatches that we used earlier can be eliminated.

This program is much simpler than the similar one in Section 19.2.4 but doesn't provide as much information. That's because operator* on a regex_-token_iterator object returns a sub_match object, which points at a portion of the target text and, unlike match_results, does not know how far into the target text this match occurred.

The class template describes an object that can serve as a forward iterator for an unmodifiable sequence of character sequences that match various parts of a regular expression.

The template argument BidIt must be a bidirectional iterator. It names the type of the iterator that will designate the target character sequence when an iterator object is created. The template arguments Elem and Rxtraits name the character type and the traits type, respectively, for the regular expression type, basic_regex<Elem,Rxtraits>, that will be passed to a regex_token_iterator object's constructor. By default, these arguments are derived from the first type argument, BidIt.

You create a regex_token_iterator object by passing two iterators that delineate a character range to be searched and a basic_regex object that holds the regular expression to search for, just as you do for a regex_iteratorobject. In addition, though, you pass one or more integer values that identify the various submatches that you want to iterate through. The constructors search for the first text subsequence that matches the regular expression. The resulting object points at the first of the designated submatches in the matching subsequence. Each application of operator++ moves to the next submatch. If the list of submatches has been exhausted, the operator searches for the next text subsequence that matches the regular expression and points at the first of the designated submatches in the matching subsequence. If there are no more matching subsequences, the iterator compares equal to the end-of-sequence iterator, which is created with the default constructor.

The template defines several nested types (Section 19.3.1) and provides five constructors and an assignment operator (Section 19.3.2). An object can be dereferenced with operator* and operator-> (Section 19.3.3) and can be incremented to point at the next element in the output sequence with operator++ (Section 19.3.4). Two regex_token_iterator objects of the same type can be compared for equality (Section 19.3.5). Four predefined types for the most commonly used character types are described in Section 19.3.6.

The definition of this template includes several members marked as exposition only:. These members are used in the descriptions that follow of some of the member functions of this template. Keep in mind that these members aren't required by TR1. The rule is that the member functions have to act as if they were implemented according to the descriptions.

The descriptions also use a couple of technical terms that are defined in TR1. A suffx iterator is an iterator object of type regex_token_iterator that points at the final sequence of characters in the target text. The current match is (*pos).prefix() if subs[N] is -1; otherwise ,(*pos)[subs[N]].

That last term is the key to understanding how a regex_token_iterator determines the sequence of submatches to return. When you construct a regex_token_iterator object, you pass one or more integer values, as described in Section 19.3.2. Those values, in turn, determine which submatches will be returned and in what order. A value of -1 refers to the text beginning at the end of the previous matchor at the beginning of the text sequence when the iterator object is first constructedand ending at the beginning of the current match. After the final, failed, search, a value of -1 refers to the text from the end of the last successful searchor the beginning of the text sequence if no search succeededto the end of the text sequence. Any other value refers to the corresponding capture group. Thus, a value of 0 means the entire matched text, a value of 1 means the first capture group, and so on. Each time you increment an iterator object, it advances to the next subgroup, as determined by those integer values. When it's gone through all those values, it moves to the next match and repeats the sequence of values.

19.3.1 Nested Types

typedef basic_regex <Elem, RXtraits > regex_type ;

The type is a synonym for basic_regex<Elem,RXtraits>.

The typedef names the type of the regular expression object that will be used in searches. For details, see the discussion in Section 19.2.1.

The first constructor stores the value of submatch in subs. The second and third constructors each copy their argument submatch into subs.

The constructors then set the value of N to 0 and the value of pos to iter(first, last, re, flags). If pos is not an end-of-sequence iterator, the constructors set res to the address of the current match. Otherwise, if any of the values stored in subs is -1, the constructors set *this to be a suffix iterator that points at the entire text sequence [first,last). Otherwise, the constructors set *this to an end-of-sequence iterator.

The first constructor takes exactly one integer argument, which designates the sub-group to be returned by the iterator. To see the entire matching text, pass the value 0. To see the nth capture group, pass n. To see the text that precedes the match, pass -1.

19.3.3 Dereferencing

The behavior of a program that calls either of these member operators on an end-of-sequence iterator is undefined. Otherwise, the first member operator returns a reference to the current match, and the second member operator returns a pointer to the current match.

The first member function makes a copy of *this, increments *this, and returns the copy.

If the stored iterator pos is an end-of-sequence iterator, the second operator marks *this as an end-of-sequence iterator. Otherwise, the operator increments the stored value N; if the result is equal to subs.size(), it sets the stored value N to 0 and increments the stored iterator pos. If incrementing the stored iterator leaves it unequal to an end-of-sequence iterator, the operator does nothing further. Otherwise, if the end of the preceding match was at the end of the character sequence, the operator marks *this as an end-of-sequence iterator. Otherwise, the operator repeatedly increments the stored value N until N == subs.size(), in which case it marks *this as an end-of-sequence iterator or until subs[N] == -1, thus ensuring that the next dereference will return the suffix of the last successful match. In all cases, the operator returns *this.

To better understand how a submatch selector of -1 works, think of the target text as a sequence of subsequences U1M1U2M2···UmMmUm+1, where the various subsequences Mimatch the regular expression, and the various subsequences Uido not match the regular expression. A selector of -1 selects the Uisubsequences, including the final nonmatching subsequence Um+1 if it is not empty.

The first member function returns true if *this and right are both end-of-sequence iterators or if both are suffix iterators that point at the same text sequence. Otherwise, if either of them is an end-of-sequence iterator or a suffix iterator, the member function returns false. Otherwise, the member function returns pos == right.pos && subs == right.subs && N == right.N.

The second member function returns !(*this == right).

Two regex_token_iterator objects compare equal if they were constructed from the same regular expression argument and equal other arguments, and they have been incremented the same number of times. When you make a copy of a regex_token_iterator object, the first requirement is obviously satisfied, so a copy of a regex_token_iterator object is equal to the object it was copied from if both have been incremented the same number of times since the copy was made.

Repetitive Searches

WEBINAR:On-Demand

Exercises

Exercise 1 For each of the following errors, write a simple test case containing the error, and try to compile it. In the error messages, look for the key words that relate to the error in the code.

Attempting to construct a regex_iterator object by passing a pair of iterators whose character type is different from the regex_iterator type's character type

Attempting to construct a regex_iterator object by passing a regular expression object whose element type or traits type is different from the regex_iterator type's element type or traits type

Attempting to construct a regex_token_iterator object by passing a pair of iterators whose character type is different from the regex_-token_iterator type's character type

Attempting to construct a regex_token_iterator object by passing a regular expression object whose element type or traits type is different from the regex_token_iterator type's element type or traits type

Attempting to construct a regex_token_iterator object by passing a field specifier as a pointer to int instead of an array of int

Attempting to decrement a regex_iterator object

Attempting to decrement a regex_token_iterator object

Exercise 2 In the first part of this chapter, I mentioned that it's a little hard to reuse the brute-force loop. In this exercise, we look at a couple of possible approaches to reuse and at doing the same thing with regular expression iterators.

Write a program that has a copy of the code of the search function in Example 19.8. Change the search function so that for a successful match, it shows the contents of the first capture group instead of the entire match. Now use the function to copy to cout all text that occurs between the tags "<CODE>" and "</CODE>"4in an HTML file of your choosing. 5

Now write another program that has a copy of the code of the search function in Example 19.8. Change the search function into a template function with a template type parameter named Fn and an additional function call argument, Fn func. Also replace the code that shows the match by inserting it into coutwith a call to func(match). Now use the function for the same search as in the preceding part of this exercise. 6

Write a program that uses a regex_iterator object to do the same search.

Write a program that uses a regex_token_iterator object to do the same search.

Now change all four programs to copy to cout all text that occurs between the tags "<CODE>" and "</CODE>" or between the tags "<PRE>"7 and "</PRE>".

Exercise 3 Use a pair of regex_iterator objects to search for valid hostnames8in an HTML file, and use the utility function you wrote for Exercise 2 in Chapter 18 to show the contents of each successful match.

Exercise 4 Write a program that uses a pair of regex_token_iterator objects to extract data fields from a comma-separated file. Don't forget to allow for spaces and tabs before and after each comma.9

Exercise 5 Write a program that puts the integer values 1 and 4 into a vector<int> and passes that vector as the field specifier in the constructor of a regex_token_iterator object. Use that object to search for your favorite regular expression. Now put the same values into an array of int, pass that array to the constructor, and repeat the search. What happens if the field index is higher than the index of the last capture group in the regular expression? What happens if you repeat a field index in the initializer?

Exercise 6 HTML cross-references have the form <AHREF="reference">text</A>" and <ANAME="reference">text</A>". The first is a link, and the second is the target of a link. In both cases, the reference is in quotes. Write a program that uses a pair of regex_token_iterator objects to search for cross-references in an HTML file and shows, for each cross-reference, either "HREF=" or "NAME=", as appropriate, followed by text of the reference.

Footnotes

1 That is, unless "reuse" means "cut and paste," as is often the case, for example, in Java.

2 The type cregex_iterator is a regex_iterator that looks at sequences delineated by char*s.

3 Note that the iterator holds the address of the regular expression object, not a copy. Once the regular expression object is destroyed, the iterator can no longer be used.

4 That is, search for text matching the regular expression "<CODE>(.*?)</CODE>"; for each successful match, write out the contents of capture group 1.

5Hint: Read the entire text file into a string object by creating an ifstream object to read the file and a basic_ostringstream object to build the string, and inserting the buffer returned by the ifstream's member function rdbuf() into the basic_ostringstream object.

6 You'll have to write a callable type whose function call operator takes a match_resultsobject and copies the first capture group to cout.

7 That is, search for text matching the regular expression "<(CODE|PRE)>(.*?)</\1>"; for each successful match, write out the contents of capture group 2.

8 See Exercise 2 in Chapter 17 for a suitable regular expression.

9Hint: Write a regular expression that describes the separator, and use an iterator that shows the text that doesn't match the separator.

Advertiser Disclosure:
Some of the products that appear on this site are from companies from which QuinStreet receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. QuinStreet does not include all companies or all types of products available in the marketplace.

Thanks for your registration, follow us on our social networks to keep up-to-date