SIMD-friendly algorithms for substring searching

Author:

Wojciech Muła

Date:

2016-11-28

Updated on:

2018-02-14 (spelling), 2017-04-29 (ARMv8 results)

Introduction

Popular programming languages provide methods or functions which locate a
substring in a given string. In C it is the function strstr, the C++
class std::string has the method find, Python's string has methods
pos and index, and so on, so forth. All these APIs were designed for
one-shot searches. During past decades several algorithms to solve this
problem were designed, an excellent page by Christian Charras and
Thierry Lecroqlists most of them (if not all). Basically these
algorithms could be split into two major categories: (1) based on
Deterministic Finite Automaton, like Knuth-Morris-Pratt, Boyer Moore, etc.,
and (2) based on a simple comparison, like the Karp-Rabin algorithm.

The main problem with these standard algorithms is a silent assumption
that comparing a pair of characters, looking up in an extra table and
conditions are cheap, while comparing two substrings is expansive.

But current desktop CPUs do not meet this assumption, in particular:

There is no difference in comparing one, two, four or 8 bytes on a 64-bit
CPU. When a processor supports SIMD instructions, then comparing vectors
(it means 16, 32 or even 64 bytes) is as cheap as comparing a single byte.

Thus comparing short sequences of chars can be faster than fancy algorithms
which avoids such comparison.

Looking up in a table costs one memory fetch, so at least a L1 cache round
(~3 cycles). Reading char-by-char also cost as much cycles.

Mispredicted jumps cost several cycles of penalty (~10-20 cycles).

There is a short chain of dependencies: read char, compare it, conditionally
jump, which make hard to utilize out-of-order execution capabilities present
in a CPU.

The Karp-Rabin algorithm does the exact substring comparison whenever weak
hashes are equal. One hash is calculated just once for searched substring,
and another one is calculated for string's portion; in every iteration the
second hash is updated at small cost. Following code shows the idea:

SIMD solutions replace the hash predicate with a vector predicate, which
is calculated in parallel and, hopefully, is calculated fast. For each
"true" element of the predicate vector an exact comparison of substrings is
performed.

This is one source of improvement, another is a careful implementation.
A generic implementation calls a function like memcmp to compare
substrings. But while we know the length of searched substring, we may
provide specialisations for certain lengths, where a subprocedure call
is replaced by a few CPU instructions, even just one. Thanks to that
the cost of calling the procedure and all internal memcmp costs are
simply ridden off.

Algorithm 1: Generic SIMD

Algorithm

This algorithm is suitable for all SIMD instruction sets and also SWAR approach. It
uses as a predicate equality of the first and the last characters from the
substring.

These two characters are populated in two registers, F and L
respectively. Then in each iteration two chunks of strings are loaded. The
first chunk (A) is read from offset i (where i is the current
offset) and the second chunk (B) is read from offset i + k - 1, where
k is substring's length.

Then we compute a vector expression F == A and B == L. This step yields a
byte vector (or a bit mask), where "true" values denote position of potential
substring occurrences. Finally, just at these positions an exact comparisons of
substrings are performed.

After merging comparison results, i.e. AF & BL, we get following mask:

mask = [ 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 ]

Since the mask is non-zero, it means there are possible substring occurrences.
As we see, there is only one non-zero element at index 2, thus only one
substring comparison must be performed.

First and last?

Choosing the first and the last character from a substring is not always a
wise decision. Consider following scenario: a string contains mostly 'A'
characters, and a user wants to find "AjohndoeA" — in such situation the
number of char-wise would be large.

In order to prevent such situations an implementation can pick "last" character
as the farthest character not equal to the first one. If there is no such
character, it means that all characters in substring are the same (for example
"AAAAA"). A specialised procedure may be used to handle such patterns.

Implementation

SSE & AVX2

Both SSE and AVX2 versions are practically the same, and both use the
minimum number of instruction. Below is a generic AVX2 version.

It's worth to note that since we already know that the first and the last
characters match, we don't need to compare them again with memcmp.

SWAR

In SWAR approach, comparison for equality uses bit xor operation, which yields zero when
two bytes are equal. Therefore instead of anding partial results, the bitwise
or is used. Clearly this part of algorithm has the same complexity as the
SSE/AVX2 code.

However, SWAR requires more effort to locate zero bytes. Following procedure
calculates an exact byte mask, where MSBs of zeros are set when the
corresponding byte in x is zero.

// 7th bit set if lower 7 bits are zero
constuint64_tt0=(~x&0x7f7f7f7f7f7f7f7fllu)+0x0101010101010101llu;// 7th bit set if 7th bit is zero
constuint64_tt1=(~x&0x8080808080808080llu);uint64_tzeros=t0&t1;

Below is the C++ implementation for 64-bit vectors. The while loop contains an
additional condition which might look not optimal. But searching the first set
bit and later clearing it (as the SSE version does) is slower.

AVX512F

AVX512F lacks of operations on bytes, the smallest vector item is a
32-bit word. The limitation forces us to use SWAR techniques.

Using AVX512F instructions we compare two vectors, like in SWAR version,
i.e. two xors joined with bitwise or.
There is only one difference, a single ternary logic instruction
expresses one xor and bitwise or.

Using AVX512F instructions we locate which 32-bit elements
contain any zero byte.

Then for such 32-bit element check four substrings for equality.

Unlike the SWAR procedure, where we need a precise mask for zero bytes, an
AVX512F procedure requires just information "a word has zero byte". A
simpler algorithm, described in Bit Twiddling Hacks is used; below
is its C++ implementation.

ARM Neon (32 bit code)

The algorithm can be also easily realised using ARM Neon instructions, having
128-bit SIMD registers. The only problem is caused by long round trip from
the Neon unit back to the CPU.

It was solved by saving back the comparison result in a 64 bit word in memory:
lower nibbles come from the lower half of a SIMD register, likewise higher
nibbles come from the higher half of the register.

Comparison is done in two loops, separately for lower and higher nibbles.
This split is required to detect substring occurrences in the correct order.

It appeared that unrolling the two inner loops brought about 1.2 speedup.

AArch64 (64 bit code)

AArch64 code is almost the exact copy of the above ARM Neon procedure.
The only exception is direct reading of SIMD registers lanes, as the
architecture made this operation fast.

Algorithm 2: SSE-specific (MPSADBW)

Algorithm

SSE4.1 and AVX2 provide instruction MPSADBW, which calculates eight
Manhattan distances (L1) between given 4-byte sub-vector from one register
and eight subsequent 4-byte sub-vector from second register. The instruction
returns vector of eight words (16-bit values).

When two sub-vectors are equal, then the L1 distance is 0, and we may use this
property to locate possible substring locations. In other words equality of
four leading characters is used as a predicate.

Albeit it seems to be a stronger predicate than matching the first and the last
characters, a quadratic complexity is unavoidable. For example, when the
searched string contains one letter "a", and we're looking for "aaaabcde", then
the predicate obviously will be true for all input characters.

If it isn't enough, there are following problems:

This method handles substring not shorter than four characters.
Handling three-char substrings is viable, but require additional code.

SSE variant of MPSADBW processes only 8 bytes at once, while the
generic SIMD variant uses the whole register.

AVX2 variant of MPSADBW works on lanes, i.e. 128-bit helves of
a register rather than the whole 256-bit register. This imposes
additional code to properly load data.

Latency of the instruction is pretty hight — 5 or 7 cycles,
depending on CPU architecture. Luckily throughput is 1 or 2 cycles,
thus unrolling a loop can hide latency.

Building each vector require two shifts and one bitwise or. In each iteration
four vector comparison are performed and then four bitmasks are examined. This
make a loop, which compares substrings, quite complicated.

Moreover, to properly fill the last elements of vectors we need four bytes
beyond vector. This is accomplished by having two adjacent vectors per
iterations (one load per iteration is needed, though). Finally, instruction
VPALIGNR is used to extract required data.

size_tavx512f_strstr_long(constchar*string,size_tn,constchar*needle,size_tk){__m512icurr;__m512inext;__m512iv0,v1,v2,v3;char*haystack=const_cast<char*>(string);char*last=haystack+n;constuint32_tprf=*(uint32_t*)needle;// the first 4 bytes of needle
const__m512iprefix=_mm512_set1_epi32(prf);next=_mm512_loadu_si512(haystack);for(/**/;haystack<last;haystack+=64){curr=next;next=_mm512_loadu_si512(haystack+64);const__m512ishft=_mm512_alignr_epi32(next,curr,1);v0=curr;{const__m512it1=_mm512_srli_epi32(curr,8);const__m512it2=_mm512_slli_epi32(shft,24);v1=_mm512_or_si512(t1,t2);}{const__m512it1=_mm512_srli_epi32(curr,16);const__m512it2=_mm512_slli_epi32(shft,16);v2=_mm512_or_si512(t1,t2);}{const__m512it1=_mm512_srli_epi32(curr,24);const__m512it2=_mm512_slli_epi32(shft,8);v3=_mm512_or_si512(t1,t2);}uint16_tm0=_mm512_cmpeq_epi32_mask(v0,prefix);uint16_tm1=_mm512_cmpeq_epi32_mask(v1,prefix);uint16_tm2=_mm512_cmpeq_epi32_mask(v2,prefix);uint16_tm3=_mm512_cmpeq_epi32_mask(v3,prefix);intindex=64;while(m0|m1|m2|m3){if(m0){intpos=__builtin_ctz(m0)*4+0;m0=m0&(m0-1);if(pos<index&&memcmp(haystack+pos+4,needle+4,k-4)==0){index=pos;}}if(m1){intpos=__builtin_ctz(m1)*4+1;m1=m1&(m1-1);if(pos<index&&memcmp(haystack+pos+4,needle+4,k-4)==0){index=pos;}}if(m2){intpos=__builtin_ctz(m2)*4+2;m2=m2&(m2-1);if(pos<index&&memcmp(haystack+pos+4,needle+4,k-4)==0){index=pos;}}if(m3){intpos=__builtin_ctz(m3)*4+3;m3=m3&(m3-1);if(pos<index&&memcmp(haystack+pos+4,needle+4,k-4)==0){index=pos;}}}if(index<64){return(haystack-string)+index;}}returnsize_t(-1);}

Algorithm 3: SSE4.2-specific (PCMPESTRM)

Algorithm

SSE4.2 introduced String and Text New Instructions (STNI), a set of very
complex instructions that were meant to be a building block for string
operations. Unfortunately, Intel practically discontinued STNI in newer
processors, hasn't introduced AVX2 versions of STNI and make them extremely
slow (11 cycles latency is unacceptable).

Basically PCMPxSTRx instruction exists in four variants, which differs
only in:

A way to determine string length: the length might be given explicitly
or the first zero byte marks string end, as in traditional C strings.

How the result is saved, it might be either a bit-mask/byte-mask or
the number of first/last bit set in the bit-mask.

Additional instruction's argument (immediate constant) defines several aspects
of execution, specifically the algorithm of comparison. There are four
different algorithms available, one we're using is called range ordered.
Despite the name, this algorithm locates substring, or its prefix if a
substring goes beyond register width.

For example, when we're searching "ABCD" in "ABCD_ABC_ABCD_AB" the instruction
returns bitmask 0b0100001000000001, treating suffix "AB" as a match. Thus we
can safely assume that only the first character matches, as tail might or
might not be present in a register. (Of course it can be determined, but
require additional calculations which is not very handy.)

Implementation

SSE

size_tsse42_strstr_anysize(constchar*s,size_tn,constchar*needle,size_tk){const__m128iN=_mm_loadu_si128((__m128i*)needle);for(size_ti=0;i<n;i+=16){constintmode=_SIDD_UBYTE_OPS|_SIDD_CMP_EQUAL_ORDERED|_SIDD_BIT_MASK;const__m128iD=_mm_loadu_si128((__m128i*)(s+i));const__m128ires=_mm_cmpestrm(N,k,D,n-i,mode);uint64_tmask=_mm_cvtsi128_si64(res);while(mask!=0){constautobitpos=bits::get_first_bit_set(mask);// we know that at least the first character of needle matches
if(memcmp(s+i+bitpos+1,needle+1,k-1)==0){returni+bitpos;}mask=bits::clear_leftmost_set(mask);}}returnstd::string::npos;}

Performance results

Performance of various SIMD implementations were measured. Test programs also
have got specialisation for short substrings, that are selected at run time.
Performance of a C strstr is included for comparison. I omitted C++
string::find due to performance bug in GNU libc which makes the method
10 times slower than strstr.

Test programs were run three times. Following computers, running either
Debian or Ubuntu, were tested:

ARM Neon performance is pretty good even for SWAR implementation.
The SWAR version is 1.7 times faster than string::find, SIMD
version is 3.1 times faster.

AArch64 performance of scalar SWAR64 is almost as good as 32-bit SIMD
procedure.

Comparison with strstr might be considered unfair, as the
procedure deals with string of unknown length, while my implementations
get lengths and take advantage of this. I fully agree.

Procedures I implemented are also unsafe, because might read data off
the input string. This may lead to access violation if strings are
located just before unmapped memory. And for sure address sanitizers
will complain. Making the procedures safe is feasible, but it wasn't
my goal.

Acknowledgments

Daniel Lemire has gave me access to Haswell, Skylake, KNL, Bulldozer
and ARMv8 machines, where I compiled and run test programs. Thank you!