Introduction

The problem of searching for a string
within another string is important in Computer Science. Most programmers are
familiar with the C strstr function which provides the caller with an ability
to search for a string needle within a string haystack. This article is about a
similar function which is generated using a Linux shell script and provides a
similar facility.

Background

The approach taken for implementing this
version of string searching is to generate a search function with a variable
degree of search branches. The code searches linearly within the string
haystack using the multiple search logic each with different offsets into the
string within the main search loop.

Where in the above code, STRSTR is a macro that tests
a possible match by comparing the entire string haystack S with the string
needle T to be found. If found, the results are placed into pointer P,
otherwise P is NULL. The variable m represents the number of search logics that
are generated. In this example, m is equal to four, and there are four if
statements of the form:

if(*str && ((str + k[0]) - orig) < len && *(str + k[0]) == *f)

These if statements check the input string haystack
represented by pointer 'str' to see if the first character is a match to the
first character of the string needle represented by pointer 'f'. However, care
must be taken to assert that the test is not out of bounds from the length of
string 'str'. This is done in the first part of the test:

*str && ((str + k[0]) - orig) < len

The array k contains m offsets into string 'str'. In
this way, during a linear scan of the string 'str', multiple offsets into the
string can be scanned in the same pass. This approach is similar to how a loop
would be unraveled for optimization. Once there is a possible match,
represented by the following condition within the if statement:

*(str + k[0]) == *f

The macro STRSTR is invoked and a test to see if a
match is found is made.

Clearly, in the worst case, the algorithm will scan
the length of the entire string haystack which is N, and perform a constant m
such comparisons where the worst case for the comparisons is given by scanning
the length of the entire string needle which is M. Thus, the algorithm is O(N *
M). This will happen in cases where the string haystack is of the form 'aaaa'
and the string needle is of the form 'ab'. A pass through 'aaaa' will have each
of the m search logics match the first character of 'ab' causing STRSTR to
execute and scan all of the string needle and failing each time. However, if
the string needle does exist within the string haystack, then the algorithm
should execute much faster than that as it would likely find a possible match
within a pass of the string haystack which is enhanced by the multiple search
logics scanning at different positions within the string haystack. In the
average case, the code will likely take O(N/2) scan of the haystack (similar to
linear search) with all proceeding constant m tests failing and one succeeding
taking O(M) time for an average running time of O(N/2 + M). The best case is
that the algorithm matches the first characters of the string haystack with the
string needle, then immediately goes to STRSTR to scan for the needle in the
haystack and returns success which would be O(M) time.

Using the Code

A Bash script called gen_string_match.sh
is provided to emit the C++ code. It allows the user to generate as large or
as small a number of internal

test branches to run within the main search loop. To
run the script and

The shell script does not generate a header file, so
to use the '_strstr' function, declare in a header:

extern"C"char *_strstr(constchar *s, constchar *t);

Points of Interest

The code is generated to allow the
user to use a function that has many search logics within the linear pass of
the string haystack. If this is equal to the length of the string haystack,
then the code will match the needle within the first pass.

Comments and Discussions

Your algorithm basically cuts the searched string in m pieces and then searches them individually. Searching them one char at a time side by side does not make the algorithm more efficient in any manner than a trivial search. Having an algorithm in O(m * (n/m)) or O(n) is the same...
So what is the point of making an article for that?

BTW, your STRSTR function does not do what strstr does, and should probably not called that way to avoid confusion. This is more like a strncmp.
Your calls to STRSTR could skip first char as it was already matched in each if.

If you had proposed a solution with Open MP or some parallel stuff it could have been useful.
Of course some Boyer Moore or more advanced algo will go way faster than this algo.

And _please_ don't call all your variables with single letters, the hard disk space is not that expensive now, you are authorized to use meaningful names.