Introduction

There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently. The class is offered in two versions now: one using MFC CString and CStringArray, and another using STL std::string and std::vector.

Acknowledgement

In the June 2002 issue of C/C++ Users Journal, Moishe Halibard and Moshe Rubin published an article entitled "A Multiple Substring Search Algorithm", in which they greatly detailed their algorithm for an efficient finite state machine (FSM) to assist in searching for multiple substrings in one pass. My class is based on this algorithm.

Update Details

For those of you who have seen the versions of these classes that I published in 2002, here is a summary of the major changes I've made:

Support for a string of delimiter characters.

Acceleration of the FindNext method to skip over delimiters at the start of a search, and skip to the next delimiter when a word is rejected.

Replacement of the FSM implementation to use a std::vector of a new SEntry struct type instead of a dynamic array of DWORDs.

Class Interface

The CIVStringSet class is derived from either CStringArray or std::vector, so as to easily inherit the capability of a dynamic array of CStrings or std::strings.

However, many of the base class methods have been hidden by my class to prevent any insertion or removal of strings except at the end of the array. It is critical that the array index for a given word not be changed after it is added to the FSM.

The constructor takes a single, optional argument that determines the initial width of the FSM. The width will need to be at least as much as the combined lengths of all the strings in the array. The FSM will be enlarged, as needed as strings are added to the collection. A good guess of the total length when constructing may save a trivial reallocation later.

There are seven overloads of the Add method, allowing you to add one word at a time, or an entire set, list, or array of words at once. See the demo project for an example of using one form of Add, as well as FindFirstIn and FindNext.

As strings are added to the collection, the FSM is updated. Once all strings are added, no additional preparation is required to do a search. There are three methods for searching: FindFirstIn, FindNext, and FindAllIn. The first two are used to iterate through the found substrings one at a time. The third method uses the first two to build a CWordPosPairList for you. Simply pass in the text string to be searched as the first parameter to FindFirstIn or FindAllIn, and the class will find all instances of any strings in the array.

FindFirstIn and FindNext both return the offset within the searched text at which a token was found. The size_t & parameter will be filled upon return with the index in the string array of the string token that was found. FindAllIn returns the number of tokens found, and takes a CWordPosPairList & as its second parameter. A CWordPosPairList & is an STL list of pairs. Each pair returned contains the index and offset of a found token. See the demo program for an example of using this.

Demo Program

By request, I have also provided a simple dialog-based demo program that shows the usage of the Add and Find methods. It uses the form of Add that takes a delimited string, parses it to create the list of seek tokens, then fills a list box with those found.

Source Highlights

For those who are curious, but don't want to bother downloading the source, here are some of the most interesting functions. If the code and comments aren't satisfactory documentation, let me know.

Comments and Discussions

"The quick red fox jumped over the lazy brown dog. Now is the time for all good men to come
to the aid of their country. Ask not what your country can do for you, but what you can do
for your country."

And the search sub string is "fox jumped over the lazy" then we need method which will search this sub string and return a index as "14".

"The quick red fox jumped over the lazy brown dog. Now is the time for all good men to come
to the aid of their country. Ask not what your country can do for you, but what you can do
for your country."

Embedded Word to Search for "uic" then I need a method which will search for sub string "uic" in the main buffer and return its index as "5".

Actually I need a methods which which will search the sub string as Embedded word or Non-Embedded word depending on its input parameters.

Your method provides a way to search for Non-Embedded words but not Embedded words.

First, great, great stuff (glad you came back and revisited it) Yes, we do care! (laughing)...

A couple of comments:

1. ++ operators cause rollover, so you need a bigger container to test for exceeding the max value of a system type. m_dw64MaxUsedState should be an unsigned __int64. Then in CIVStringSet::InsertWord(), replace "if ( m_dw64MaxUsedState < MAXDWORD )" with "if ( m_ui64MaxUsedState < (unsigned __int64)MAXDWORD )". Replace "dwState = ++m_dwMaxUsedState); // use a new state number" with "dwState = (DWORD)(++m_ui64MaxUsedState); // use a new state number";

2. It's much faster if instead of instantiating new instances of things inside loops, you re-use a variable instantiated outside the loop - yes, the compiler optimizer will catch some of these, but you'd be surprised how many it misses (especially object pointers). Consider these changes to CIVStringSet::InsertWord():

if ( bNew )
dwState = (DWORD)(++m_ui64MaxUsedState); // use a new state number

// special case - last character of substring
if ( nChar == ( nLen-1 ) )
{
// Put both the state number and the word index in the entry
(m_apnFSM[dwCurState][nIdxChar]).m_dwState = dwState ;
(m_apnFSM[dwCurState][nIdxChar]).m_dwIndex = dwIndex ;
break ;
}
else if ( bNew ) // if current entry for this char value and state is still zero, add to FSM
(m_apnFSM[dwCurState][nIdxChar]).m_dwState = dwState ;

You'll see more improvement in CIVStringSet::FindNext() if you make the same kinds of changes there than in CIVStringSet::InsertWord() but InsertWord() was smaller to use as an example (smile).

3. Right idea on the character set limits, but the implementation could be improved. Rather than assuming 'A' to 'z' as the allowed characters, it's much faster (quicker search stops for unqualified matches) if you use a dynamically-determined char set drawn straight from the search words themselves. Loop over the input search words pulling out all unique characters actually used, sort them, and use members of that list as the character value test for matches. Not only will the search be fatser, but you can now support non-Latin searches and in-word punctuation such as hyphens and possessory single quotes(with some other minor changes I won't describe here). I'm too lazy to provide the code to do this within these comments, but will provide it if you really want it.

I agree. Thank you for pointing it out. In fact, I found that problem the day after I posted this (April 26th), and made some more design changes to the class. I didn't upload my revisions here because I just wasn't sure how important it would be to anyone. (Since I was updating a really old article, I didn't know if anybody really cared anymore ).

One of the changes I made was to add some new methods to the CIVStringSet class header:

Can this code (or a derivative) simply detect if all search tokens are in a string, and quit as soon as this has been determined? My impression is that the existing code continues to the end of the string and finds all matches of all tokens.

Here is a scenario: a 40 meg text file is in a memory buffer and split into about 300,000 lines. I want to make a list of those lines which contain all of the search words. The memory buffer will be searched repeatedly for different groups of tokens.

Suppose line 1000 is:
"There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently."

If the search tokens were "string" and "search", then this line can be considered a "match" after about 85 characters, and move on to the next line.

I wonder if there is a way for the code to stop looking for the token "search" after it has been found (about position 70), and proceed to only look for "string". Can a token be removed part way thru a search if this would speed up the search (maybe it wouldn't speed up the search more than a minor amount, if at all)?

Another possible optimization would be to switch to something like Boyer-Moore-Horspool for the rest of the string when there was only one token left (but that is getting pretty complicated).

Can a call such as DetectWhetherAllTokensExist be done with the existing code? If so, how? If not, what modifications would be involved?

You're welcome. As you can see that I wrote these classes almost four years ago, I had to reacquaint myself with the code.
As Phil B pointed out back in 2003, my code doesn't force full-word matches. He offered a semi-solution that insists that a match not be counted unless the following character is a space. Of course, a space is not the only word delimiter, and to check for all those would make the code slower. I'd use something like strspn or SpanIncluding if I needed to do that, though.
Removing strings from the FSM would be a speed increase, because it reduce the number of paths followed when searching. I didn't provide for removal because I never considered a project like yours before. Detecting that there is only one token left would add too much overhead to the algorithm, and negate any performance increase you might have gotten with switching to another algorithm.
To implement DetectWhetherAllTokensExist, you'd obviously need to somehow know that a particular word has already been found. If you implement a Remove method in the class, certainly removing the word from the search would accomplish that. Once all words have been removed, you'd know you're done.
So, how do you remove a string from the set? Remember that it's vital to not change the index of an entry in the set, so removing can't involve changing the position of elements in the array. I haven't pondered this enough to give you a solution now, and I just got a call from my son, who needs a ride. Maybe I'll come up with an answer later. Let me know if you get there first.

I haven't stared at your algorithm closely enough to really understand what it is doing. Is your approach somewhat similar to the Aho-Corasick algorithm?

Just a quick, uninformed guess .... would it be possible to replace the matched member of the FSM with some convention like { 0xFF 0xFF .... 0xFF } which would never match again. Then you could keep a count of matches, and when all were found, you could quit processing that line and go to the next.

Check the Acknowledgement in the article, and you'll see that the algorithm was created by Halibard and Rubin. It was published in the June 2002 issue of C/C++ Users Journal, which you can get by registering at http://www.cuj.com. No explanation of the algorithm I give here could be as thorough as reading it straight from the horses' mouths.

Once you read the article (or parse the code more closely), you'll see that the substrings do not get stored as single entities in the FSM. Indeed, by definition, a Finite State Machine is not like a table so much as it is like a map. Strings that have the same initial letters will share some entries in the FSM as they are mapped into it. This is partly why it is more efficient. Rather than comparing strings one at a time with your text, it compares all possible matches at once.

However, because of this design, it somewhat confounds the idea of removing entries from the FSM. Since entries don't necessarily belong to just one string, it's more difficult to determine whether or not it can be removed. I haven't yet had the brainstorm to reveal that solution.

I registered to look at CUG. I didn't realize you could do that without charge ... Thanks.

According to the original article, the algorithm was oriented to detecting prohibited words in e-mail, so it quit once one of the banned words was found.
"The implementation in this article returns the offset within the string of the first matching substring and stops processing."

which is quite a bit different from what I am trying to do.

The article indicates that 0 and -1 have special meaning. I haven't stared at your code enough to be close to grok'ing it, but perhaps a 0 or -1 could be "moved up" the data structure related to a word that was matched, so that detection would be shortcut and proceed to the next letter with the state machine "cleared".

If there was no prefix with any other search word (such as looking for "hello" and "goodbye"), then it would be up near the "top". If there was a common prefix (as in my original question using the example "string" and "search"), then the state associated with 's' + 't' would be "shortcut/truncated" when "string" was found.

I was wondering if it would be simpler and still advantageous if the code could keep track of which words had been matched, and quit when they were all "hit". It wouldn't necessarily stop detecting previously matched substrings.

Thanks for taking the time to refresh your memory with this code, and considering a revision. It does seem like a worthwhile enhancement with broad applicability.

BTW, have you ever seen a C or C++ implementation of the Aho-Corasick algorithm for detecting multiple sub-strings in a string? I'm only aware of that algorithm, and my understanding is that it also uses a FSM approach.

As they say, "if this stuff was easy, it wouldn't pay so well. (written by a freeware developer to a software contributor )

I was not familiar with that algorithm before, although I know the name Aho to be the 'a' in the famous "awk" Unix utility (Weinberg and Kernighan, also of great Unix fame are the 'w' and 'k'). Upon cursory examination, their algorithm appears very similar to the Halibard-Rubin one upon which my classes are based. Interestingly, they reference Kernighan's spam filter as an inspiration to their work.

There is a CP article (http://www.codeproject.com/csharp/ahocorasick.asp) that may interest you if you haven't already seen it.

At this time, I'm not interested in enhancing my classes (for free, at least). Especially since there are so many other similar pieces of code out there. It would be a simple matter to keep track of which words have been hit, but a performance drain to detect when all have been hit -- at least given what my imagination has conjured so far.

l_d_allan, did you ever locate a solution? Because I too am looking for the same thing: a class/procedure that returns true if each token exists in the search string. Ideally, I don't care how many times a token is present, so it seems a short-circuited logic works best: when a token is found once, we stop searching for that one. But, I guess that's a bit different from how FSM works.

I have not spent a lot of time viewing the code, but it appears that the efficient part of this procedure is the 'looking' for the substring. Creating new FSM from words/substrings of interest appears not to be expensive.

So here is my suggestion. Use FindFirstIn to locate the first substring. Now create a new 'problem' with the rest of the original string from the point where substring was found. Create a new FSM with one less substring and start a new search. Repeat the process till all the words/substrings have been found. You do not have to make any changes to the class or the search procedure to do this.

please help me to add a CSTringList to an object of CIVstringSet.I tried to use the overloaded member function ADD(CStringList list);but an error is being shown.IT says that "cannot convert from CStringList to const char *);
moreover the Add() function is taking too much time to add all the CString to the list when i pass a CString as a parameter to the ADD function.

WHY cant i use the overloaded member function ADD(CSTringLIST list);plz reply as i am facing a deadline.

Are you using a CStringList object, or perhaps a pointer to a CStringList? The function requires a direct reference to the object, not a pointer. You are using the MFC version of the class, and not the STL version, right?

As concerns "taking too much time", can you be more specific? Have you identified which step is taking the most time?

you can dramatically increase the number of strings this class can handle by changing
the array m_apnFSM to be some STL dynamically-allocating type, and using that template-class's dynamic allocation mechanisms in favor of the SetColDim method -- this should also be faster
if you're adding a large number of strings, as the original increases by a small static amount each time it re-allocs, and the STL libraries try to amortize the cost of memory allocation.
with that change it seems to scale to around 80,000 strings on my machine.
(i used a deque >, as the [] operator is a little faster for vector, but deques are faster if you keep forcing the collection to allocate more memory).

Oleg,
I have not needed to adapt this to Unicode. This could be adaptable to Unicode strings by substituting wchar for char in the code. However, there is an assumption that only 128 characters are in the alphabet, as we only allocate 128 arrays and mask out indexes above 0x7F. With some small effort, I think you could do it.

I am designing a Postscript editing program, which will search through the Postscript text file looking for a particular string. The problem I am having is the amount of time it is taking to search each line for the string. Is there any way of speeding this up, and not looking at each line, maybe somehow "jump" to the correct place in the file?

I thought of porting the class to Java, as I find it useful. During the process I found that it uses DWORD as a container for (non-defined) structs.
This makes the algorithm less then obvious, not to mention cumbersome to port.
C++ is blessed with all that is required to make the algorithm both readable and efficient, so why not use it?
Apart from that - great class.