TL;DR

Wrote some C and assembly code that uses SIMD instructions to perform prefix
matching of strings. The C code was between 4-7x faster than the baseline
implementation for prefix matching. The assembly code was 9-12x faster than the
baseline specifically for the negative match case (determining that an incoming
string definitely does not prefix match any of our known
strings). The fastest negative match could be done in around 6 CPU cycles, which
is pretty quick. (Integer division, for example, takes about 90 cycles.)

Overview

Goal: given a string, determine if it prefix-matches a set of known strings as
fast as possible. That is, in a set of known strings, do any of them prefix
match the incoming search string?

A reference implementation was written in C as a baseline, which simply looped
through an array of strings, comparing each one, byte-by-byte, looking for a
prefix match. Prefix match performance ranged from 28 CPU cycles to 130, and
negative match performance was around 74 cycles.

A SIMD-friendly C structure called STRING_TABLE was
derived. It is optimized for up to 16 strings, ideally of length less than or
equal 16 characters. The table is created from the set of known strings
up-front; it is sorted by length, ascending, and a unique character (with
regards to other characters at the same byte offset) is then extracted, along
with its index. A 16 byte character array, STRING_SLOT, is used to capture the unique characters.
A 16 element array of unsigned characters, SLOT_INDEX, is used to capture the
index. Similarly, lengths are stored in the same fashion via SLOT_LENGTHS.
Finally, a 16 element array of STRING_SLOTs is used to capture up to the first
16 bytes of each string in the set.

An example of the memory layout of the STRING_TABLE structure at run time, using
sample test data, is depicted below. Note
the width of each row is 16 bytes (128 bits), which is the size of an XMM register.

The layout of the STRING_TABLE structure allows us to determine if a given
search string does not prefix match all 16 strings at once
in 12 assembly instructions. This breaks down into 18 µops, with a
block throughput of 3.48 cycles on Intel's Skylake architecture. (In practice,
this clocks in at around 6 CPU cycles.)

Here's a simplified walk-through of a negative match in action,
using the search string "CAT":

Ten iterations of a function named IsPrefixOfStringInTable were authored. The
tenth and final iteration was the
fastest, prefix matching in as little as 19 cycles — a 4x improvement over
the baseline. Negative matching took 11 cycles — a 6.7x improvement.

An assembly
version of the algorithm was authored specifically to optimize for the negative
match case, and was able to do so in as little as 8 cycles, representing a 9x
improvement over the baseline. (It was a little bit slower than the fastest
C routine in the case of prefix matches, though, as can be seen below.)

Feedback for an early draft of this article was then solicited via Twitter,
resulting in four more iterations of the C version, and three more iterations of
the assembly version. The PGO build of the fastest C version prefix matched in
about 16 cycles (and also had the best "worst case input string" performance
(where three slots needed comparison), negative matching in about 26 cycles).
The fifth iteration of the assembly version negative matched in about 6 cycles,
a 3 and 1 cycle improvement, respectively.

We were then ready to publish, but felt compelled to investigate an odd
performance quirk we'd noticed with one of the assembly routines, which
yielded 7 more assembly versions. Were any of them faster? Let's find out.

The Background

The Tracer Project

One of the frustrations I had with existing Python profilers was that there was
no easy or efficient means to filter or exclude trace information based on the module
name of the code being executed. I tackled this in my
tracer project, which allows you to
set an environment variable named TRACER_MODULE_NAMES to restrict which modules
should be traced, e.g.:

set TRACER_MODULE_NAMES=myproject1;myproject2;myproject3.subproject;numpy;pandas;scipy

If the code being executed is coming from the module
myproject3.subproject.foo, then we need to trace it, as that string
prefix matches the third entry on our list.

This article details the custom data structure and algorithm I came up with in
order to try and solve the prefix matching problem more optimally with a SIMD
approach. The resulting StringTable
component is used extensively within the tracer project, and as such, must
conform to unique constraints such as no use of the C runtime library and
allocating all memory through TraceStore-backed allocators. Thus, it's not
really something you'd drop in to your current project in its current form.
Hopefully, the article still proves to be interesting.

Note: the code samples provided herein are copied directly from the tracer
project, which is written in C and assembly, and uses the Pascal-esque
Cutler Normal Form style for C. If you're used to the more UNIX-style
Kernel Normal Form of C, it's quite like that, except that it's
absolutely nothing like that, and all these code samples will probably be
very jarring.

Baseline C Implementation

The simplest way of solving this in C is to have an array of C strings (i.e.
NULL terminated byte arrays), then for each string, loop through byte by byte
and see if it prefix matches the search string.

Another type of code pattern that the string table attempts to replace is
anything that does a lot of if/else if/else if-type string comparisons to
look for keywords. For example, in the
Quake III source, there's some symbol/string processing logic that looks
like this:

An example of using the string table approach for this problem is discussed
in the Other Applications section.

The Proposed Interface

Let's take a look at the interface we're proposing, the
IsPrefixOfStringInTable function, that this article is based upon:

//
// Our string table index is simply a char, with -1 indicating no match found.
//
typedef CHAR STRING_TABLE_INDEX;
#define NO_MATCH_FOUND -1
typedef
STRING_TABLE_INDEX
(IS_PREFIX_OF_STRING_IN_TABLE)(
_In_ PSTRING_TABLE StringTable,
_In_ PSTRING String,
_Out_opt_ PSTRING_MATCH StringMatch
);
typedef IS_PREFIX_OF_STRING_IN_TABLE *PIS_PREFIX_OF_STRING_IN_TABLE;
IS_PREFIX_OF_STRING_IN_TABLE IsPrefixOfStringInTable;
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/

All implementations discussed in this article adhere to that function signature.
The STRING_TABLE structure will be discussed shortly.

The STRING_MATCH structure is used to optionally communicate information about
the prefix match back to the caller. The index and characters matched fields
are often very useful when using the string table for text parsing; see the other applications section below for an example.

The structure is defined as follows:

//
// This structure is used to communicate matches back to the caller.
//
typedef struct _STRING_MATCH {
//
// Index of the match.
//
BYTE Index;
//
// Number of characters matched.
//
BYTE NumberOfMatchedCharacters;
//
// Pad out to 8-bytes.
//
USHORT Padding[3];
//
// Pointer to the string that was matched. The underlying buffer will
// stay valid for as long as the STRING_TABLE struct persists.
//
PSTRING String;
} STRING_MATCH, *PSTRING_MATCH, **PPSTRING_MATCH;
C_ASSERT(sizeof(STRING_MATCH) == 16);

The Test Data

Instead of using some arbitrary Python module names, this article is going to
focus on a string table constructed out of a set of 16 strings that represent
reserved names of the NTFS file system, at least when it was first released
way back in the early 90s.

This list is desirable as it has good distribution of characters, there is
a good mix of both short and long entries, plus one oversized one
($INDEX_ALLOCATION, which clocks in at 17 characters), and almost all
strings lead with a common character (the dollar sign), preventing a simple
first character optimization used by
the initial version of the StringTable component I wrote in 2016.

So the scenario we'll be emulating, in this case, is that we've just been passed
a filename for creation, and we need to check if it prefix matches any of the
reserved names.

Here's the full list of NTFS names we'll be using. We're assuming 8-bit ASCII
encoding (no UTF-8) and case sensitive. (If this were actually the NT kernel,
we'd need to use wide characters with UTF-16 enconding, and be
case-insensitive.)

NTFS Reserved Names

$AttrDef

$BadClus

$Bitmap

$Boot

$Extend

$LogFile

$MftMirr

$Mft

$Secure

$UpCase

$Volume

$Cairo

$INDEX_ALLOCATION

$DATA

????

.

The ordering is important in certain cases. For example, when you have
overlapping strings, such as $MftMirr, and $Mft, you should put the longest
strings first. They will be matched first, and as our routine terminates upon
the first successful prefix match — if a longer string resided after a
shorter one, it would never get detected.

Let's review some guiding design requirements and cover some of the design
decisions I made, which should help shape your understanding of the
implementation.

Requirements and Design Decisions

The STRING struct will be used to capture incoming search strings as well as the
representation of any strings registered in the table (or more accurately, in
the corresponding StringArray structure associated with the string table.

The design should optimize for string lengths less than or equal to 16. Lengths
greater than 16 are permitted, up to 128 bytes, but they incur more overhead during
the prefix lookup.

The design should prioritize the fast-path code where there is no match for a
given search string. Being able to terminate the search as early as possible is
ideal.

The performance hits taken by unaligned data access are non-nelgible, especially
when dealing with XMM/YMM loads. Pay special care to alignment constrants and
make sure that everything under our control is aligned on a suitable boundary.
(The only thing we can't really control in the real world is the alignment of
the incoming search string buffer, which will often be at undesirable alignments
like 2, 4, 6, etc. Our test program explicitly aligns the incoming search
strings on 32-byte boundaries to avoid the penalties associated with unaligned
access.)

The string table is geared toward a single-shot build. Once you've created it
with a given string array or used a delimited environment variable, that's it.
There are no AddString() or RemoveString() routines. The order you provided the
strings in will be the same order the table uses — no re-ordering will be
done. Thus, for prefix matching purposes, if two strings share a common prefix,
the longer one should go first, as the prefix search routine will check it first.

Only single matches are performed; the first match that qualifies as a prefix
match (target string in table had length less than or equal to the search
string, and all of its characters matched). There is no support for obtaining
multiple matches — if you've constructed your string tables properly
(no duplicate or incorrectly-ordered overlapping fields), you shouldn't need to.

So, to summarise, the design guidelines are as follows.

Prioritize fast-path exit in the non-matched case. (I refer to this as
negative matching in a lot of places.)

Optimize for up to 16 string slots, where each slot has up to 16
characters, ideally. It can have up to 128 in total, however, any bytes
outside of the first sixteen live in the string array structure
supporting the string table (accessible via pStringArray).

If a slot is longer than 16 characters, optimize for the assumption that
it won't be *that* much longer. i.e. assume a string of length 18 bytes
is more common than 120 bytes.

The Data Structures

The primary data structure employed by this solution is the STRING_TABLE
structure. It is composed of supporting structures: STRING_SLOT, SLOT_INDEX and
SLOT_LENGTH, and either embeds or points to the originating STRING_ARRAY
structure from which it was created.

Let's review the STRING_TABLE
(view on GitHub) structure first and then touch on the supporting
structures.

STRING_TABLE

C - Cutler Normal Form

C - Kernel Normal Form

MASM

//
// The STRING_TABLE struct is an optimized structure for testing whether a
// prefix entry for a string is in a table, with the expectation that the
// strings being compared will be relatively short (ideally <= 16 characters),
// and the table of string prefixes to compare to will be relatively small
// (ideally <= 16 strings).
//
// The overall goal is to be able to prefix match a string with the lowest
// possible (amortized) latency. Fixed-size, memory-aligned character arrays,
// and SIMD instructions are used to try and achieve this.
//
typedef struct _STRING_TABLE {
//
// A slot where each individual element contains a uniquely-identifying
// letter, with respect to the other strings in the table, of each string
// in an occupied slot.
//
STRING_SLOT UniqueChars;
//
// (16 bytes consumed.)
//
//
// For each unique character identified above, the following structure
// captures the 0-based index of that character in the underlying string.
// This is used as an input to vpshufb to rearrange the search string's
// characters such that it can be vpcmpeqb'd against the unique characters
// above.
//
SLOT_INDEX UniqueIndex;
//
// (32 bytes consumed.)
//
//
// Length of the underlying string in each slot.
//
SLOT_LENGTHS Lengths;
//
// (48 bytes consumed, aligned at 16 bytes.)
//
//
// Pointer to the STRING_ARRAY associated with this table, which we own
// (we create it and copy the caller's contents at creation time and
// deallocate it when we get destroyed).
//
// N.B. We use pStringArray here instead of StringArray because the
// latter is a field name at the end of the struct.
//
//
PSTRING_ARRAY pStringArray;
//
// (56 bytes consumed, aligned at 8 bytes.)
//
//
// String table flags.
//
STRING_TABLE_FLAGS Flags;
//
// (60 bytes consumed, aligned at 4 bytes.)
//
//
// A 16-bit bitmap indicating which slots are occupied.
//
USHORT OccupiedBitmap;
//
// A 16-bit bitmap indicating which slots have strings longer than 16 chars.
//
USHORT ContinuationBitmap;
//
// (64 bytes consumed, aligned at 64 bytes.)
//
//
// The 16-element array of STRING_SLOT structs. We want this to be aligned
// on a 64-byte boundary, and it consumes 256-bytes of memory.
//
STRING_SLOT Slots[16];
//
// (320 bytes consumed, aligned at 64 bytes.)
//
//
// We want the structure size to be a power of 2 such that an even number
// can fit into a 4KB page (and reducing the likelihood of crossing page
// boundaries, which complicates SIMD boundary handling), so we have an
// extra 192-bytes to play with here. The CopyStringArray() routine is
// special-cased to allocate the backing STRING_ARRAY structure plus the
// accommodating buffers in this space if it can fit.
//
// (You can test whether or not this occurred by checking the invariant
// `StringTable->pStringArray == &StringTable->StringArray`, if this
// is true, the array was allocated within this remaining padding space.)
//
union {
STRING_ARRAY StringArray;
CHAR Padding[192];
};
} STRING_TABLE, *PSTRING_TABLE, **PPSTRING_TABLE;
//
// Assert critical size and alignment invariants at compile time.
//
C_ASSERT(FIELD_OFFSET(STRING_TABLE, UniqueIndex) == 16);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Lengths) == 32);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, pStringArray) == 48);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Slots) == 64);
C_ASSERT(FIELD_OFFSET(STRING_TABLE, Padding) == 320);
C_ASSERT(sizeof(STRING_TABLE) == 512);

The following diagram depicts an in-memory representation of the STRING_TABLE
structure using our NTFS reserved prefix names. It is created via the
CreateStringTable routine, which we feature
in the appendix of this article.

In order to improve the uniqueness of the unique characters selected from each
string, the strings are sorted by length during string table creation and
enumerated in this order whilst identifying unique characters. The rationale
behind this is that shorter strings simply have fewer characters to choose from,
longer strings have more to choose from. If we identified unique characters in
the order they appear in the string table, we may have longer strings preceeding
shorter ones, such that toward the end of the table, nothing unique can be
extracted from the short ones.

The utility of the string table is maximised by ensuring a unique character is
selected from every string, thus, we sort by length first. Note that the
uniqueness is actually determined by offset:character pairs, with the offsets
becoming the indices stored in the UniqueIndex slot. If you trace
through the diagram above, you'll see that the unique character in each slot
matches the character in the corresponding string slot, indicated by the
underlying index.

Supporting Structures

The string array captures a raw array representation of the underlying strings
making up the string table. It is either embedded within the padding area at the
end of the string table, or a separate allocation is made during string table
creation. The main interface to creating a string table is via a STRING_ARRAY
structure. The helper functions,
CreateStringTableFromDelimitedString
and
CreateStringTableFromDelimitedEnvironmentVariable
simply break down their input into a STRING_ARRAY representation first
before calling
CreateStringTable
.

STRING_ARRAY

typedef struct _Struct_size_bytes_(SizeInQuadwords>>3) _STRING_ARRAY {
//
// Size of the structure, in quadwords. Why quadwords? It allows us to
// keep this size field to a USHORT, which helps with the rest of the
// alignment in this struct (we want the STRING Strings[] array to start
// on an 8-byte boundary).
//
// N.B. We can't express the exact field size in the SAL annotation
// below, because the array of buffer sizes are inexpressible;
// however, we know the maximum length, so we can use the implicit
// invariant that the total buffer size can't exceed whatever num
// elements * max size is.
//
_Field_range_(<=, (
sizeof(struct _STRING_ARRAY) +
((NumberOfElements - 1) * sizeof(STRING)) +
(MaximumLength * NumberOfElements)
) >> 3)
USHORT SizeInQuadwords;
//
// Number of elements in the array.
//
USHORT NumberOfElements;
//
// Minimum and maximum lengths for the String->Length fields. Optional.
//
USHORT MinimumLength;
USHORT MaximumLength;
//
// A pointer to the STRING_TABLE structure that "owns" us.
//
struct _STRING_TABLE *StringTable;
//
// The string array. Number of elements in the array is governed by the
// NumberOfElements field above.
//
STRING Strings[ANYSIZE_ARRAY];
} STRING_ARRAY, *PSTRING_ARRAY, **PPSTRING_ARRAY;

And finally, we're left with the smaller helper structs that we use to
encapsulate the various innards of the string table. (I use unions that
feature XMMWORD representations (which is a typedef of __m128i, representing
an XMM register) as well as underlying byte/character representations as I
personally find it makes the resulting C code a bit nicer.)

String Table Construction

The
CreateSingleStringTable routine is responsible for construction of a new
STRING_TABLE. It is here we identify the unique set of characters (and their
indices) to store in the first two fields of the string table.

CreateSingleStringTable

//
// Define private types used by this module.
//
typedef struct _LENGTH_INDEX_ENTRY {
BYTE Length;
BYTE Index;
} LENGTH_INDEX_ENTRY;
typedef LENGTH_INDEX_ENTRY *PLENGTH_INDEX_ENTRY;
typedef struct _LENGTH_INDEX_TABLE {
LENGTH_INDEX_ENTRY Entry[16];
} LENGTH_INDEX_TABLE;
typedef LENGTH_INDEX_TABLE *PLENGTH_INDEX_TABLE;
typedef union DECLSPEC_ALIGN(32) _CHARACTER_BITMAP {
YMMWORD Ymm;
XMMWORD Xmm[2];
LONG Bits[(256 / (4 << 3))]; // 8
} CHARACTER_BITMAP;
C_ASSERT(sizeof(CHARACTER_BITMAP) == 32);
typedef CHARACTER_BITMAP *PCHARACTER_BITMAP;
typedef struct _SLOT_BITMAPS {
CHARACTER_BITMAP Bitmap[16];
} SLOT_BITMAPS;
typedef SLOT_BITMAPS *PSLOT_BITMAPS;
//
// Function implementation.
//
_Use_decl_annotations_
PSTRING_TABLE
CreateSingleStringTable(
PRTL Rtl,
PALLOCATOR StringTableAllocator,
PALLOCATOR StringArrayAllocator,
PSTRING_ARRAY StringArray,
BOOL CopyArray
)
/*++
Routine Description:
Allocates space for a STRING_TABLE structure using the provided allocators,
then initializes it using the provided STRING_ARRAY. If CopyArray is set
to TRUE, the routine will copy the string array such that the caller is
free to destroy it after the table has been successfully created. If it
is set to FALSE and StringArray->StringTable has a non-NULL value, it is
assumed that sufficient space has already been allocated for the string
table and this pointer will be used to initialize the rest of the structure.
DestroyStringTable() must be called against the returned PSTRING_TABLE when
the structure is no longer needed in order to ensure resources are released.
Arguments:
Rtl - Supplies a pointer to an initialized RTL structure.
StringTableAllocator - Supplies a pointer to an ALLOCATOR structure which
will be used for creating the STRING_TABLE.
StringArrayAllocator - Supplies a pointer to an ALLOCATOR structure which
may be used to create the STRING_ARRAY if it cannot fit within the
padding of the STRING_TABLE structure. This is kept separate from the
StringTableAllocator due to the stringent alignment requirements of the
string table.
StringArray - Supplies a pointer to an initialized STRING_ARRAY structure
that contains the STRING structures that are to be added to the table.
CopyArray - Supplies a boolean value indicating whether or not the
StringArray structure should be deep-copied during creation. This is
typically set when the caller wants to be able to free the structure
as soon as this call returns (or can't guarantee it will persist past
this function's invocation, i.e. if it was stack allocated).
Return Value:
A pointer to a valid PSTRING_TABLE structure on success, NULL on failure.
Call DestroyStringTable() on the returned structure when it is no longer
needed in order to ensure resources are cleaned up appropriately.
--*/
{
BYTE Byte;
BYTE Count;
BYTE Index;
BYTE Length;
BYTE NumberOfElements;
ULONG HighestBit;
ULONG OccupiedMask;
PULONG Bits;
USHORT OccupiedBitmap;
USHORT ContinuationBitmap;
PSTRING_TABLE StringTable;
PSTRING_ARRAY StringArray;
PSTRING String;
PSTRING_SLOT Slot;
STRING_SLOT UniqueChars;
SLOT_INDEX UniqueIndex;
SLOT_INDEX LengthIndex;
SLOT_LENGTHS Lengths;
LENGTH_INDEX_TABLE LengthIndexTable;
PCHARACTER_BITMAP Bitmap;
SLOT_BITMAPS SlotBitmaps;
PLENGTH_INDEX_ENTRY Entry;
//
// Validate arguments.
//
if (!ARGUMENT_PRESENT(StringTableAllocator)) {
return NULL;
}
if (!ARGUMENT_PRESENT(StringArrayAllocator)) {
return NULL;
}
if (!ARGUMENT_PRESENT(SourceStringArray)) {
return NULL;
}
if (SourceStringArray->NumberOfElements == 0) {
return NULL;
}
//
// Copy the incoming string array if applicable.
//
if (CopyArray) {
StringArray = CopyStringArray(
StringTableAllocator,
StringArrayAllocator,
SourceStringArray,
FIELD_OFFSET(STRING_TABLE, StringArray),
sizeof(STRING_TABLE),
&StringTable
);
if (!StringArray) {
return NULL;
}
} else {
//
// We're not copying the array, so initialize StringArray to point at
// the caller's SourceStringArray, and StringTable to point at the
// array's StringTable field (which will be non-NULL if sufficient
// space has been allocated).
//
StringArray = SourceStringArray;
StringTable = StringArray->StringTable;
}
//
// If StringTable has no value, we've either been called with CopyArray set
// to FALSE, or CopyStringArray() wasn't able to allocate sufficient space
// for both the table and itself. Either way, we need to allocate space for
// the table.
//
if (!StringTable) {
StringTable = (PSTRING_TABLE)(
StringTableAllocator->AlignedCalloc(
StringTableAllocator->Context,
1,
sizeof(STRING_TABLE),
STRING_TABLE_ALIGNMENT
)
);
if (!StringTable) {
return NULL;
}
}
//
// Make sure the fields that are sensitive to alignment are, in fact,
// aligned correctly.
//
if (!AssertStringTableFieldAlignment(StringTable)) {
DestroyStringTable(StringTableAllocator,
StringArrayAllocator,
StringTable);
return NULL;
}
//
// At this point, we have copied the incoming StringArray if necessary,
// and we've allocated sufficient space for the StringTable structure.
// Enumerate over all of the strings, set the continuation bit if the
// length > 16, set the relevant slot length, set the relevant unique
// character entry, then move the first 16-bytes of the string into the
// relevant slot via an aligned SSE mov.
//
//
// Initialize pointers and counters, clear stack-based structures.
//
Slot = StringTable->Slots;
String = StringArray->Strings;
OccupiedBitmap = 0;
ContinuationBitmap = 0;
NumberOfElements = (BYTE)StringArray->NumberOfElements;
UniqueChars.CharsXmm = _mm_setzero_si128();
UniqueIndex.IndexXmm = _mm_setzero_si128();
LengthIndex.IndexXmm = _mm_setzero_si128();
//
// Set all the slot lengths to 0x7f up front instead of defaulting
// to zero. This allows for simpler logic when searching for a prefix
// string, which involves broadcasting a search string's length to an XMM
// register, then doing _mm_cmpgt_epi8() against the lengths array and
// the string length. If we left the lengths as 0 for unused slots, they
// would get included in the resulting comparison register (i.e. the high
// bits would be set to 1), so we'd have to do a subsequent masking of
// the result at some point using the OccupiedBitmap. By defaulting the
// lengths to 0x7f, we ensure they'll never get included in any cmpgt-type
// SIMD matches. (We use 0x7f instead of 0xff because the _mm_cmpgt_epi8()
// intrinsic assumes packed signed integers.)
//
Lengths.SlotsXmm = _mm_set1_epi8(0x7f);
ZeroStruct(LengthIndexTable);
ZeroStruct(SlotBitmaps);
for (Count = 0; Count < NumberOfElements; Count++) {
XMMWORD CharsXmm;
//
// Set the string length for the slot.
//
Length = Lengths.Slots[Count] = (BYTE)String->Length;
//
// Set the appropriate bit in the continuation bitmap if the string is
// longer than 16 bytes.
//
if (Length > 16) {
ContinuationBitmap |= (Count == 0 ? 1 : 1 << (Count + 1));
}
if (Count == 0) {
Entry = &LengthIndexTable.Entry[0];
Entry->Index = 0;
Entry->Length = Length;
} else {
//
// Perform a linear scan of the length-index table in order to
// identify an appropriate insertion point.
//
for (Index = 0; Index < Count; Index++) {
if (Length < LengthIndexTable.Entry[Index].Length) {
break;
}
}
if (Index != Count) {
//
// New entry doesn't go at the end of the table, so shuffle
// everything else down.
//
Rtl->RtlMoveMemory(&LengthIndexTable.Entry[Index + 1],
&LengthIndexTable.Entry[Index],
(Count - Index) * sizeof(*Entry));
}
Entry = &LengthIndexTable.Entry[Index];
Entry->Index = Count;
Entry->Length = Length;
}
//
// Copy the first 16-bytes of the string into the relevant slot. We
// have taken care to ensure everything is 16-byte aligned by this
// stage, so we can use SSE intrinsics here.
//
CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);
_mm_store_si128(&(*Slot).CharsXmm, CharsXmm);
//
// Advance our pointers.
//
++Slot;
++String;
}
//
// Store the slot lengths.
//
_mm_store_si128(&(StringTable->Lengths.SlotsXmm), Lengths.SlotsXmm);
//
// Loop through the strings in order of shortest to longest and construct
// the uniquely-identifying character table with corresponding index.
//
for (Count = 0; Count < NumberOfElements; Count++) {
Entry = &LengthIndexTable.Entry[Count];
Length = Entry->Length;
Slot = &StringTable->Slots[Entry->Index];
//
// Iterate over each character in the slot and find the first one
// without a corresponding bit set.
//
for (Index = 0; Index < Length; Index++) {
Bitmap = &SlotBitmaps.Bitmap[Index];
Bits = (PULONG)&Bitmap->Bits[0];
Byte = Slot->Char[Index];
if (!BitTestAndSet(Bits, Byte)) {
break;
}
}
UniqueChars.Char[Count] = Byte;
UniqueIndex.Index[Count] = Index;
LengthIndex.Index[Count] = Entry->Index;
}
//
// Loop through the elements again such that the unique chars are stored
// in the order they appear in the table.
//
for (Count = 0; Count < NumberOfElements; Count++) {
for (Index = 0; Index < NumberOfElements; Index++) {
if (LengthIndex.Index[Index] == Count) {
StringTable->UniqueChars.Char[Count] = UniqueChars.Char[Index];
StringTable->UniqueIndex.Index[Count] = UniqueIndex.Index[Index];
break;
}
}
}
//
// Generate and store the occupied bitmap. Each bit, from low to high,
// corresponds to the index of a slot. When set, the slot is occupied.
// When clear, it is not. So, fill bits from the highest bit set down.
//
HighestBit = (1 << (StringArray->NumberOfElements-1));
OccupiedMask = _blsmsk_u32(HighestBit);
StringTable->OccupiedBitmap = (USHORT)OccupiedMask;
//
// Store the continuation bitmap.
//
StringTable->ContinuationBitmap = (USHORT)(ContinuationBitmap);
//
// Wire up the string array to the table.
//
StringTable->pStringArray = StringArray;
//
// And we're done, return the table.
//
return StringTable;
}

The Benchmark

The performance comparison graphs in the subsequent sections were generated in
Excel, using CSV data output by the creatively-named program
StringTable2BenchmarkExe.

Modern CPUs are fast, timing is hard, especially when you're dealing with CPU
cycle comparisons. No approach is perfect. Here's what I settled on:

The benchmark utility has #pragma optimize("", off) at the
start of the file, which disables global optimizations, even in release
(optimized) builds. This prevents the compiler doing clever things with
regards to scheduling of the timestamping logic, which affects reported
times.

The benchmark utility pins itself to a single core and sets its thread
priority to the highest permissible value at startup. (Turbo is
disabled on the computer, such that the frequency is pinned to 3.68GHz.)

The benchmark utility is fed an array of function pointers and test
inputs. It iterates over each test input, and then iterates over
each function, calling it with the test input and potentially verifying
the result (some functions are included for comparison purposes but
don't actually produce correct results, and thus, do not have their
results verified).

The test input string is copied into a local buffer that is aligned on a
32 byte boundary. This ensures that all test inputs are being compared
fairly. (The natural alignment of the buffers varies anywhere from 2 to
512 bytes, unaligned buffers have a significant impact on the timings.)

The function is run once, with the result captured. If verification has
been requested, the result is verified. We __debugbreak()
immediately if there's a mismatch, which is handy during development.

NtDelayExecution(TRUE, 1) is called, which results in a sleep of
approximately 100 nanoseconds. This is done to force a context switch,
such that the thread gets a new scheduling quantum before each function
is run.

The function is executed 100 times for warmup.

Timings are taken for 1000 iterations of the function using the given
test input. The __rdtscp() intrinsic is used (which forces
some serialization) to capture the timestamp counter before and after the
iterations.

This process is repeated 100 times. The minimum time observed to
perform 1000 iterations (out of 100 attempts) is captured as the
function's best time.

Release vs PGO Oddities

All of the times in the graphs come from the profile-guided optimization build
of the StringTable component. The PGO build is faster than the normal release
build in every case, except one, where it is notably slower.

It's... odd. I haven't investigated it. The following graph depicts the
affected function, IsPrefixOfStringInTable_1, and a few other versions for
reference, and depicts the performance of the PGO build to the release build on
the input strings "$INDEX_ALLOCATION" and "$Bai123456789012".

Only that function is affected, and the problem really only manifests on
the two example test strings depicted. As this routine essentially serves
as one of the initial baseline implementations, it would be misleading to
compare all of our optimized PGO versions to the abnormally-slow baseline
implementation. So, the release and PGO timings were blended together into
a single CSV, and the Excel PivotTables pick whatever the minimum time is
for a given function and test input.

Thus, you're always looking at the PGO timings, except for this outlier case
where the release versions are faster.

The Implementations

Round 1

In this section, we'll take a look at the various implementations I experimented
with on the first pass, prior to soliciting any feedback. I figured there were
a couple of ways I could present this information. First, I could hand-pick
what I choose to show and hide, such that a nice rosey picture is presented that
makes it seem like I effortlessly arrived at the fastest implementation
without much actual effort whatsoever.

Or I could show the gritty reality of how everything actually went
down in a chronological fashion, errors and all. And there were definitely some
errors! For better or for worse, I've chosen to go down this route, so you'll
get to enjoy some pretty tedious tweaks (changing a single line, for example)
before the juicy stuff really kicks in.

Additionally, with the benefit of writing this little section introduction
retro-actively, iterations 4 and 5 aren't testing what I thought they were
initially testing. I've left them in as is; if anything, it demonstrates the
importance of only changing one thing at a time, and making sure you're testing
what you think you're testing. I'll discuss the errors with those iterations
later in the article.

IsPrefixOfCStrInArray

Let's review the baseline implementation again, as that's what we're ultimately
comparing ourselves against. This version enumerates the string array (and thus
has a slightly different function signature to the STRING_TABLE-based functions)
looking for prefix matches. No SIMD instructions are used. The timings
captured should be proportional to the location of the test input string in the
array. That is, it should take less time to prefix match strings that occur
earlier in the array versus those that appear later.

IsPrefixOfStringInTable_1

This version is similar to the IsPrefixOfCStrInArray
implementation, except it utilizes the slot length information provided by the
STRING_ARRAY structure, and conforms to our standard
IsPrefixOfStringInTable function signature. It uses no SIMD
instructions.

That's an interesting result! Even without using any SIMD instructions, version
1, the IsPrefixOfStringInTable_1 routine, is faster (in all but one case)
than the baseline IsPrefixOfCStrInArray routine, thanks to a more
sophisticated data structure.

(And really, it's not even using the sophisticated parts of the
STRING_TABLE; it's just leveraging the fact that we've captured the
lengths of each string in the backing STRING_ARRAY structure, by
virtue of the fact that we use the STRING structure to wrap our
strings (versus relying on the standard NULL-terminated C string approach).)

IsPrefixOfStringInTable_2

This version is the first of the routines to use SIMD instructions. It is
actually based on the prefix matching routine I wrote for the first version
of the StringTable component back in 2016. The layout of the STRING_TABLE
struct differed in the first version; only the first character of each slot
was used to do the initial exclusion (as opposed to the unique character),
and lengths were unsigned shorts instead of chars (16 bits instead of 8 bits),
so the match bitmap had to be constructed slightly differently.

None of those details really apply to our second attempt at the StringTable
component, detailed in this article. Our lengths are 8 bits, and we use unique
characters in the initial negative match fast-path. However, the first version
used an elaborate AVX2 prefix match routine that was geared toward matching
long strings, and attempted to use non-temporal streaming load instructions
where possible (which would only make sense for a large number of long strings
in a very small set of cache-thrashing scenarios).
Compare our simpler implementation, IsPrefixMatch, which we use in
version 3 onward, to the far more elaborate (and unncessary)
IsPrefixMatchAvx2:

FORCEINLINE
USHORT
IsPrefixMatchAvx2(
_In_ PCSTRING SearchString,
_In_ PCSTRING TargetString,
_In_ USHORT Offset
)
{
USHORT SearchStringRemaining;
USHORT TargetStringRemaining;
ULONGLONG SearchStringAlignment;
ULONGLONG TargetStringAlignment;
USHORT CharactersMatched = Offset;
LONG Count;
LONG Mask;
PCHAR SearchBuffer;
PCHAR TargetBuffer;
STRING_SLOT SearchSlot;
XMMWORD SearchXmm;
XMMWORD TargetXmm;
XMMWORD ResultXmm;
YMMWORD SearchYmm;
YMMWORD TargetYmm;
YMMWORD ResultYmm;
SearchStringRemaining = SearchString->Length - Offset;
TargetStringRemaining = TargetString->Length - Offset;
SearchBuffer = (PCHAR)RtlOffsetToPointer(SearchString->Buffer, Offset);
TargetBuffer = (PCHAR)RtlOffsetToPointer(TargetString->Buffer, Offset);
//
// This routine is only called in the final stage of a prefix match when
// we've already verified the slot's corresponding original string length
// (referred in this routine as the target string) is less than or equal
// to the length of the search string.
//
// We attempt as many 32-byte comparisons as we can, then as many 16-byte
// comparisons as we can, then a final < 16-byte comparison if necessary.
//
// We use aligned loads if possible, falling back to unaligned if not.
//
StartYmm:
if (SearchStringRemaining >= 32 && TargetStringRemaining >= 32) {
//
// We have at least 32 bytes to compare for each string. Check the
// alignment for each buffer and do an aligned streaming load (non-
// temporal hint) if our alignment is at a 32-byte boundary or better;
// reverting to an unaligned load when not.
//
SearchStringAlignment = GetAddressAlignment(SearchBuffer);
TargetStringAlignment = GetAddressAlignment(TargetBuffer);
if (SearchStringAlignment < 32) {
SearchYmm = _mm256_loadu_si256((PYMMWORD)SearchBuffer);
} else {
SearchYmm = _mm256_stream_load_si256((PYMMWORD)SearchBuffer);
}
if (TargetStringAlignment < 32) {
TargetYmm = _mm256_loadu_si256((PYMMWORD)TargetBuffer);
} else {
TargetYmm = _mm256_stream_load_si256((PYMMWORD)TargetBuffer);
}
//
// Compare the two vectors.
//
ResultYmm = _mm256_cmpeq_epi8(SearchYmm, TargetYmm);
//
// Generate a mask from the result of the comparison.
//
Mask = _mm256_movemask_epi8(ResultYmm);
//
// There were at least 32 characters remaining in each string buffer,
// thus, every character needs to have matched in order for this search
// to continue. If there were less than 32 characters, we can terminate
// this prefix search here. (-1 == 0xffffffff == all bits set == all
// characters matched.)
//
if (Mask != -1) {
//
// Not all characters were matched, terminate the prefix search.
//
return NO_MATCH_FOUND;
}
//
// All 32 characters were matched. Update counters and pointers
// accordingly and jump back to the start of the 32-byte processing.
//
SearchStringRemaining -= 32;
TargetStringRemaining -= 32;
CharactersMatched += 32;
SearchBuffer += 32;
TargetBuffer += 32;
goto StartYmm;
}
//
// Intentional follow-on to StartXmm.
//
StartXmm:
//
// Update the search string's alignment.
//
if (SearchStringRemaining >= 16 && TargetStringRemaining >= 16) {
//
// We have at least 16 bytes to compare for each string. Check the
// alignment for each buffer and do an aligned streaming load (non-
// temporal hint) if our alignment is at a 16-byte boundary or better;
// reverting to an unaligned load when not.
//
SearchStringAlignment = GetAddressAlignment(SearchBuffer);
if (SearchStringAlignment < 16) {
SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
} else {
SearchXmm = _mm_stream_load_si128((XMMWORD *)SearchBuffer);
}
TargetXmm = _mm_stream_load_si128((XMMWORD *)TargetBuffer);
//
// Compare the two vectors.
//
ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);
//
// Generate a mask from the result of the comparison.
//
Mask = _mm_movemask_epi8(ResultXmm);
//
// There were at least 16 characters remaining in each string buffer,
// thus, every character needs to have matched in order for this search
// to continue. If there were less than 16 characters, we can terminate
// this prefix search here. (-1 == 0xffff -> all bits set -> all chars
// matched.)
//
if ((SHORT)Mask != (SHORT)-1) {
//
// Not all characters were matched, terminate the prefix search.
//
return NO_MATCH_FOUND;
}
//
// All 16 characters were matched. Update counters and pointers
// accordingly and jump back to the start of the 16-byte processing.
//
SearchStringRemaining -= 16;
TargetStringRemaining -= 16;
CharactersMatched += 16;
SearchBuffer += 16;
TargetBuffer += 16;
goto StartXmm;
}
if (TargetStringRemaining == 0) {
//
// We'll get here if we successfully prefix matched the search string
// and all our buffers were aligned (i.e. we don't have a trailing
// < 16 bytes comparison to perform).
//
return CharactersMatched;
}
//
// If we get here, we have less than 16 bytes to compare. Our target
// strings are guaranteed to be 16-byte aligned, so we can load them
// using an aligned stream load as in the previous cases.
//
TargetXmm = _mm_stream_load_si128((PXMMWORD)TargetBuffer);
//
// Loading the remainder of our search string's buffer is a little more
// complicated. It could reside within 15 bytes of the end of the page
// boundary, which would mean that a 128-bit load would cross a page
// boundary.
//
// At best, the page will belong to our process and we'll take a performance
// hit. At worst, we won't own the page, and we'll end up triggering a hard
// page fault.
//
// So, see if the current search buffer address plus 16 bytes crosses a page
// boundary. If it does, take the safe but slower approach of a ranged
// memcpy (movsb) into a local stack-allocated STRING_SLOT structure.
//
if (!PointerToOffsetCrossesPageBoundary(SearchBuffer, 16)) {
//
// No page boundary is crossed, so just do an unaligned 128-bit move
// into our Xmm register. (We could do the aligned/unaligned dance
// here, but it's the last load we'll be doing (i.e. it's not
// potentially on a loop path), so I don't think it's worth the extra
// branch cost, although I haven't measured this empirically.)
//
SearchXmm = _mm_loadu_si128((XMMWORD *)SearchBuffer);
} else {
//
// We cross a page boundary, so only copy the the bytes we need via
// __movsb(), then do an aligned stream load into the Xmm register
// we'll use in the comparison.
//
__movsb((PBYTE)&SearchSlot.Char,
(PBYTE)SearchBuffer,
SearchStringRemaining);
SearchXmm = _mm_stream_load_si128(&SearchSlot.CharsXmm);
}
//
// Compare the final vectors.
//
ResultXmm = _mm_cmpeq_epi8(SearchXmm, TargetXmm);
//
// Generate a mask from the result of the comparison, but mask off (zero
// out) high bits from the target string's remaining length.
//
Mask = _bzhi_u32(_mm_movemask_epi8(ResultXmm), TargetStringRemaining);
//
// Count how many characters were matched and determine if we were a
// successful prefix match or not.
//
Count = __popcnt(Mask);
if ((USHORT)Count == TargetStringRemaining) {
//
// If we matched the same amount of characters as remaining in the
// target string, we've successfully prefix matched the search string.
// Return the total number of characters we matched.
//
CharactersMatched += (USHORT)Count;
return CharactersMatched;
}
//
// After all that work, our string match failed at the final stage! Return
// to the caller indicating we were unable to make a prefix match.
//
return NO_MATCH_FOUND;
}

The AVX2 routine is overkill, especially considering the emphasis we put on
favoring short strings versus longer ones in the requirements section. But
we want to put broad statements like that to the test, so let's include it
as our first SIMD implementation so that we can see how it stacks up against
the simpler versions.

Also note, this is the first time we're seeing the full body of the SIMD-style
IsPrefixOfStringInTable implementation. It's commented heavily,
and, in general, the core algorithm doesn't fundamentally change across
iterations (things are just tweaked slightly), so I'd recommend reading through
it thoroughly to build up a mental model of how the matching algorithm works.
It's pretty straight forward, and the subsequent iterations will make a lot more
sense as they're typically presented as diffs against the previous version
first.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_2(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This is our first AVX-optimized version of the routine.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable->pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray->MinimumLength > String->Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatchAvx2(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Let's see how version 2, our first SIMD attempt, performs in comparison to the
two baselines.

Eek! Our first SIMD attempt actually has worse prefix matching performance in
most cases! The only area where it shows a performance improvement is negative
matching.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This is our first AVX-optimized version of the routine.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable->pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray->MinimumLength > String->Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Phew! We finally see superior performance across the board. This ends the
short lived tenure of version 2, which is demonstrably worse in every case.
We'll also omit the IsPrefixOfCStrInArray routine from the graphs
for now (for the most part), as it has served its initial baseline purpose.

IsPrefixOfStringInTable_4

When I first wrote the initial string table code, I was playing around with
different strategies for loading the initial search string buffer. That
resulted in the file
StringLoadStoreOperations.h, which defined a bunch of helper macros.
I've included them below, but don't spend too much time absorbing them, they're
not good practice, and they're all irrelevant anyway as soon as we switch to
_mm_loadu_si128() in a few versions. I'm including them because
they set the scene for versions 4, 5 and 6.

/*++
VOID
LoadSearchStringIntoXmmRegister_SEH(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
Attempts an aligned 128-bit load of String->Buffer into Slot.CharXmm via
the _mm_load_si128() intrinsic. The intrinsic is surrounded in a __try/
__except block that catches EXCEPTION_ACCESS_VIOLATION exceptions.
If such an exception is caught, the routine will check to see if the string
buffer's address will cross a page boundary if 16-bytes are loaded. If a
page boundary would be crossed, a __movsb() intrinsic is used to copy only
the bytes specified by String->Length, otherwise, an unaligned 128-bit load
is attemped via the _mm_loadu_si128() intrinsic.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String->Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_SEH(Slot, String, LengthVar) \
LengthVar = min(String->Length, 16); \
TRY_SSE42_ALIGNED { \
Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer); \
} CATCH_EXCEPTION_ACCESS_VIOLATION { \
if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \
__movsb(Slot.Char, String->Buffer, LengthVar); \
} else { \
Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \
} \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlignmentCheck(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine checks to see if a page boundary will be crossed if 16-bytes
are loaded from the address supplied by String->Buffer. If a page boundary
will be crossed, a __movsb() intrinsic is used to only copy String->Length
bytes into the given Slot.
If no page boundary will be crossed by a 128-bit load, the alignment of
the address supplied by String->Buffer is checked. If the alignment isn't
at least on a 16-byte boundary, an unaligned load will be issued via the
_mm_loadu_si128() intrinsic, otherwise, an _mm_load_si128() will be used.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String->Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_AlignmentCheck(Slot, String,LengthVar) \
LengthVar = min(String->Length, 16); \
if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \
__movsb(Slot.Char, String->Buffer, LengthVar); \
} else if (GetAddressAlignment(String->Buffer) < 16) { \
Slot.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer); \
} else { \
Slot.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer); \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlwaysUnaligned(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine performs an unaligned 128-bit load of the address supplied by
String->Buffer into the given Slot via the _mm_loadu_si128() intrinsic.
No checks are done regarding whether or not a page boundary will be crossed.
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String->Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_Unaligned(Slot, String, LengthVar) \
LengthVar = min(String->Length, 16); \
if (PointerToOffsetCrossesPageBoundary(String->Buffer, 16)) { \
__movsb(Slot.Char, String->Buffer, LengthVar); \
} else if (GetAddressAlignment(String->Buffer) < 16) { \
Slot.CharsXmm = _mm_loadu_si128(String->Buffer); \
} else { \
Slot.CharsXmm = _mm_load_si128(String->Buffer); \
}
/*++
VOID
LoadSearchStringIntoXmmRegister_AlwaysMovsb(
_In_ STRING_SLOT Slot,
_In_ PSTRING String,
_In_ USHORT LengthVar
);
Routine Description:
This routine copies min(String->Length, 16) bytes from String->Buffer
into the given Slot via the __movsb() intrinsic. The memory referenced by
the Slot is not cleared first via SecureZeroMemory().
Arguments:
Slot - Supplies the STRING_SLOT local variable name within the calling
function that will receive the results of the load operation.
String - Supplies the name of the PSTRING variable that is to be loaded
into the slot. This will usually be one of the function parameters.
LengthVar - Supplies the name of a USHORT local variable that will receive
the value of min(String->Length, 16).
Return Value:
None.
--*/
#define LoadSearchStringIntoXmmRegister_AlwaysMovsb(Slot, String, LengthVar) \
LengthVar = min(String->Length, 16); \
__movsb(Slot.Char, String->Buffer, LengthVar);

This basically allowed me to toggle which of the strategies I wanted to use to
do load the search string into an XMM register. As you can see above, the
default is to use the AlwaysMovsb approach*; so, with version 4,
let's swap that out for the SEH approach, which wraps the aligned
load in a structured exception handler that falls back to __movsb()
if the aligned load fails and the pointer plus 16 bytes crosses a page boundary.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_3(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine is a variant of version 3 that uses a structured exception
handler for loading the initial search string.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable->pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray->MinimumLength > String->Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister_SEH(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Performance of version 4 was slightly worse than 3 in every case:

Version 3 is still in the lead with the AlwaysMovsb-based search string
loading approach.

IsPrefixOfStringInTable_5

Version 5 is an interesting one. It's the first time we attempt to validate our
claim that it's more efficient to give the CPU a bunch of independent things to
do up-front, versus putting more branches in and attempting to terminate as
early as possible.

Note: we'll also explicitly use the LoadSearchStringIntoXmmRegister_AlwaysMovsb
macro here, instead of LoadSearchStringIntoXmmRegister, just to
make it more explicit that we're actually relying on the
__movsb()-based string loading routine.

_Use_decl_annotations_
_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_5(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine is a variant of version 3 that uses early exits (i.e.
returning NO_MATCH_FOUND as early as we can). It is designed to evaluate
the assertion we've been making that it's more optimal to give the CPU
to do a bunch of things up front versus doing something, then potentially
branching, doing the next thing, potentially branching, etc.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG CharBitmap;
ULONG LengthBitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable->pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray->MinimumLength > String->Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
LoadSearchStringIntoXmmRegister_AlwaysMovsb(Search, String, SearchLength);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
CharBitmap = _mm_movemask_epi8(IncludeSlotsByUniqueChar);
if (!CharBitmap) {
return NO_MATCH_FOUND;
}
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
LengthBitmap = _mm_movemask_epi8(IncludeSlotsByLength);
if (!LengthBitmap) {
return NO_MATCH_FOUND;
}
Bitmap = CharBitmap & LengthBitmap;
if (!Bitmap) {
return NO_MATCH_FOUND;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}
// vim:set ts=8 sw=4 sts=4 tw=80 expandtab :

If our theory is correct, the performance of this version should be worse, due
to all the extra branches in the initial test. Let's see if we're right:

Holy smokes, version 5 is bad! It's so bad it's actually
closest in territory to the bunk version 2 that had the elaborate AVX2 prefix
matching routine. (Note: actually it was so close I ended up double-checking
the two routines were correct; they were, so this is just a
coincidence.)

Narrator: nice "double-checking", you putz.

That's good news though, as it validates this assumption that we've been working
with since inception:

//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//

That's the end of version's 5 tenure. TL;DR: less branches > more branches.

Narrator: more accurate TL;DR: __movsb() is slow, and always
make sure you're testing what you think you're testing before.

IsPrefixOfStringInTable_6

Version 6 is boring. We tweak the initial loading of the search string,
explicitly loading it via an unaligned load. If the underlying buffer is
aligned on a 16 byte boundary, this is just as fast as an aligned load.
If not, hey, at least it doesn't crash — it's just slow.

(If you attempt an aligned load on an address that isn't aligned at a 16 byte
boundary, the processor will generate an exception, resulting in the crash of
your program (assuming you don't have any structured exception handlers in place
to catch the error).)

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_6(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine differs from version 3 in that we do an unaligned load of
the search string buffer without any SEH wrappers or alignment checks.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable->pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray->MinimumLength > String->Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
SearchLength = min(String->Length, 16);
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Version 6 should be faster than version 3; we omit alignment checks, all of our
input buffers are aligned at 32 bytes, and an unaligned XMM load of an aligned
buffer should definitely be faster than a __movsb(). Let's see:

We have a new winner! Version 3 had a good run, but it's time to retire.
Let's tweak version 6 going forward.

Narrator: this is actually testing _mm_loadu_si128() against
the
AlignmentCheck
routine, which first calls
PointerToOffsetCrossesPageBoundary(), and then checks the
address alignment before calling _mm_load_si128().
As unaligned loads are just as fast as aligned loads as long as the
underlying buffer is aligned, all this is really showing is that it's
slightly faster not doing the pointer boundary check and address
alignment check, which shouldn't be that surprising.

IsPrefixOfStringInTable_7

Version 7 tweaks version 6 a little bit. We don't need the search string length
calculated so early in the routine. Let's move it to later.

Diff

Full

% diff -u IsPrefixOfStringInTable_6.c IsPrefixOfStringInTable_7.c
--- IsPrefixOfStringInTable_6.c 2018-04-15 22:35:55.450273700 -0400
+++ IsPrefixOfStringInTable_7.c 2018-04-26 10:00:53.905933700 -0400
@@ -18,7 +18,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_6(
+IsPrefixOfStringInTable_7(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -31,9 +31,10 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This routine differs from version 3 in that we do an aligned load of the
- search string buffer without any SEH wrappers or alignment checks. (Thus,
- this routine will fault if the buffer is unaligned.)
+ This routine is based off version 6, but alters when we calculate the
+ "search length" for the given string, which is done via the expression
+ 'min(String->Length, 16)'. We don't need this value until later in the
+ routine, when we're ready to start comparing strings.
Arguments:
@@ -125,7 +126,6 @@
// Load the first 16-bytes of the search string into an XMM register.
//
- SearchLength = min(String->Length, 16);
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
@@ -213,6 +213,13 @@
}
//
+ // Calculate the "search length" of the incoming string, which ensures we
+ // only compare up to the first 16 characters.
+ //
+
+ SearchLength = min(String->Length, 16);
+
+ //
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_7(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine is based off version 6, but alters when we calculate the
"search length" for the given string, which is done via the expression
'min(String->Length, 16)'. We don't need this value until later in the
routine, when we're ready to start comparing strings.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
PSTRING_ARRAY StringArray;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
StringArray = StringTable-&gt;pStringArray;
//
// If the minimum length of the string array is greater than the length of
// our search string, there can't be a prefix match.
//
if (StringArray-&gt;MinimumLength &gt; String-&gt;Length) {
goto NoMatch;
}
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String-&gt;Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable-&gt;UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&amp;StringTable-&gt;Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&amp;StringTable-&gt;UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String-&gt;Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String-&gt;Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap &gt;&gt;= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&amp;StringTable-&gt;Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 &amp;&amp; Length &gt; 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match-&gt;Index = (BYTE)Index;
Match-&gt;NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match-&gt;String = &amp;StringTable-&gt;pStringArray-&gt;Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

This is a tiny change; if it shows any performance difference, it should err on
the side of a positive change, although perhaps the compiler noticed that we
didn't use the expression until much later and deferred the scheduling until
after the initial negative match logic. Let's see:

Tiny change, tiny performance improvement! Looks like this saves a couple of
cycles, thus ending the short-lived rein of version 6.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_8(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This routine is based off version 7, but omits the initial minimum
length test of the string array.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
goto NoMatch;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
NoMatch:
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Hey, look at that, another win across the board! Omitting the length test
shaves off a few more cycles for both prefix and negative matching. Version 7's
one-round rein has come to a timely end.

IsPrefixOfStringInTable_9

Version 9 tweaks version 8 and simply does return NO_MATCH_FOUND
after the initial bitmap check versus goto NoMatch. (The use of goto was a
bit peculiar there, anyway. And we're going to rewrite the body in a similar
fashion for version 10, but let's try stick to making one change at a time.)

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_9(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This is a tweaked version of version 8 that does 'return NO_MATCH_FOUND'
after the initial bitmap check versus 'goto NoMatch'.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
return NO_MATCH_FOUND;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched == 16 && Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
} else {
//
// We successfully prefix matched the search string against
// this slot. The code immediately following us deals with
// handling a successful prefix match at the initial slot
// level; let's avoid an unnecessary branch and just jump
// directly into it.
//
goto FoundMatch;
}
}
if ((USHORT)CharactersMatched == Length) {
FoundMatch:
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// Not enough characters matched, so continue the loop.
//
} while (--Count);
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

This is an interesting one. The return versus goto looks to have cost us a
little bit with the first few test inputs. But only a tiny amount, we're
talking about like 0.2 more cycles, which is nothing in the grand scheme of
things. (Although let's not pull on that thread too much, the entire premise
of the whole article will quickly unravel!)

Version 9 improves the negative match performance by a few cycles, though, so
let's keep it.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_10(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This version is based off version 9, but rewrites the inner loop that
checks for comparisons.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
return NO_MATCH_FOUND;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched < Length && Length <= 16) {
//
// The slot length is longer than the number of characters matched
// from the search string; this isn't a prefix match. Continue.
//
continue;
}
if (Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
}
}
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
} while (--Count);
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

That's a nicer bit of logic. More C-like, less assembly-like. Arguably
clearer. Let's see how they compare. (This is an interesting one as I
genuinely don't have a strong hunch what sort of a performance impact this
will have; obviously I thought the first way of structuring the loop was
optimal, and I had that in place for two years before deciding to embark
on this article, which led to the rework we just saw.)

Hey, look at that, we've shaved off a few more cycles in most cases, especially
for the negative matches!

Speeding Up Negative Matches with Assembly

(Note: if you build the Tracer project, you can run a helper
batch file in the root directory called cdb-simple.bat, which uses cdb to launch
one of the project's executables called ModuleLoader.exe, which will start up,
load all of our tracing project's DLLs, then allow the debugger to break in,
which yields us a debugger prompt from which we can easily disassemble functions
and inspect runtime function entries etc. This is the approach I used for
capturing the output over the next couple of sections.)

Now for the fun part! Let's take a look at the disassembly of the initial part
of version 10 that is responsible for doing the negative match logic and see if
there are any improvements we can make.

There's a bit of cruft there at the start with regards to setting up the
function's prologue (pushing non-volatile registers to the stack, etc). That's
to be expected for C (and C++, and basically every language); as the programmer,
you don't really have any direct control over how many registers
a compiler uses for a routine, how much stack space it uses, which registers it
uses when, etc.

However, with assembly, we're on the other end of the spectrum: we can control
everything! We also have a little trick up our sleeves: the venerable
LEAF_ENTRY.

First, some background. The Windows x64 ABI and calling convention dictates there
are two types of functions:
NESTED_ENTRY and LEAF_ENTRY. NESTED_ENTRY is by far the most common; C and C++ functions are
all implicitly NESTED_ENTRY functions. (The LEAF_ENTRY and NESTED_ENTRY symbols
are MASM (ml64.exe) macro names, but the concept applies to all languages.)

A LEAF_ENTRY can only be implemented in assembly. It is constrained in that it
may not manipulate any of the non-volatile x64 registers (rbx, rdi, rsi, rsp,
rbp, r12, r13, r14, r15, xmm6-15), nor may it call any other functions
(because call implicitly modifies the stack pointer), nor may it have a structured
exception handler (because handling an exception for a given stack frame also
manipulates the stack pointer).

The reason behind all of these constraints is that LEAF_ENTRY routines do not
have any unwind information generated for them in their runtime function
entries. Unwind information is used by the kernel to do just that, unwind
the modifications made to non-volatile registers whilst traversing back up
through the call stack looking for an exception handler in the case of an
exception.

For example, here's the function entry and associated unwind information for the
PGO build of the IsPrefixOfStringInTable_10 function:

We can see that this routine manipulates 6 non-volatile registers in total,
including the stack pointer. The first instructions of the routine constitute
the function's prologue; in the disassembly, you can see that three of the rxx
registers are pushed to the stack and then 0x20 (32) bytes of stack space is
allocated:

It also cheekily uses the home parameter space for stashing rbp and rsi, instead
of pushing them to the stack. That's fair game though, this is the PGO build,
I'd expect it to use some extra tricks for shaving off a few more cycles here
and there. I'd do the same thing if I were writing assembly. (Side note: if
you view the source of this page, there's a commented-out section below that
depicts the runtime function entry for the release build of version 10; it uses
9 registers instead of 6, and 40 bytes of stack space instead of 32. I wrote it
before switching to using the PGO for everything.)

The home parameter space is a 32 byte area that immediately follows the return
address (i.e. the value of rsp when the function is entered); it is
mandated by the x64 calling convention on Windows, and is primarily intended to
provide some scratch space for a routine to home its parameter
registers (i.e. the registers used for the first four arguments of a function:
rcx, rdx, r8 and r9). This allows the four volatile registers to be repurposed
within a routine, but also have a way to refer to the parameters again if need
be. At least, that's what it was intended for — however, its not
something that is enforced, you can basically treat the area as a free 32 byte
scratch area if you're writing assembly.

(On a semi-related note, I'd highly recommend reading
A History of Modern 64-bit Computing if you have some spare time, it's a
fascinating insight into contemporary x64 conventions we often take for granted,
drawing on numerous interviews with industry luminaries like Dave Cutler and Linus
Torvalds. I found it incredibly useful for understanding the why
behind things like home parameter space, structured exception handling, runtime
function entries, and why you can't write inline assembly for x64 with MSVC
anymore — it provides a direct vector for obliterating the mechanisms
relied upon by the kernel stack unwinding functionality. (At least, I think
that's the reason — can anyone from Microsoft confirm?))

Note how we don't need to push anything to the stack as we didn't manipulate any
non-volatile registers. If an exception occurs within the body of our
implementation (say we dereference a NULL pointer), the kernel knows it doesn't
have to undo any non-volatile register modifications (using offsets specified by
the unwind information) because there isn't any unwind information. It can
simply advance to the frame before us (e.g. rsp at the time of the fault, minus
8 bytes) as it continues its search for runtime function entries and associated
unwind information. As you can see, the unwind info is effectively empty:

Let's see how this scrappy little fellow (who always returns NO_MATCH_FOUND but
still mimics the steps required to successfuly negative match) does against the
leading C implementation at this point, version 10:

Fwoah, look at that, we've shaved about three cycles off the C version!

(Note that when I first wrote this, I was comparing the assembly version against
the release build (not the PGO build), which was clocking in at about 13-14
cycles for negative matching. So getting it down to ~7.5 from 13-14 was a bit
more exciting. Damn the PGO build and it's 10.9-ish cycles for negative
matching!)

The good news is that our theory about the performance of the LEAF_ENTRY looks
like it's paid off: we can reliably get about 7.5 cycles for negative matching.

IsPrefixOfStringInTable_x64_2

The bad news is that we now need to implement the rest of the functionality
within the constraints of a LEAF_ENTRY!

The problem with a LEAF_ENTRY for anything more than a trivial bit of code is
that you only have a handful of volatile registers to work with, and no stack
space can be used for register spilling or temporaries. (Technically I could
use the home parameter space, but, eh, we're already avoiding stack spills, why
not make life harder for ourselves and try avoid all memory
spilling.)

If you can't spill to memory, your only option is really spilling to XMM
registers via vpinsr and vpextr combinations, which,
as you can see in the implementation of version 2 below, I have to do a lot.

(Also note: when I wrote this version, I didn't use the disassembly from the C
routines for guidance. I find that as soon as you start to grok the disassembly
for a given routine, it becomes harder to think of ways to approach it from a
fresh angle. Also, the LEAF_ENTRY aspect significantly limited what I could do
anyway, so I figured I may as well just give it a crack from scratch and see
what I could come up with. It would be an interesting point of reference
compared to a future iteration that tries to improve on the disassembly of an
optimized PGO version, for example.)

The diff view for this routine is less useful given the vast majority of the
code is new, so I've put the full version of the code first. It's based more or
less on the approach used by version 8 of the C routine (I actually wrote it
after I wrote version 8; versions 9 and 10 of the C routine (with the latter
having the improved loop logic) came after).

Looking back on my time logs (shout out to my favorite iPhone app,
HoursTracker!),
the routine above took about 8 hours to implement over the course of about two
days, give or take. Writing assembly is slow, writing correct assembly is even
slower. I generally find that there's a noticeable hump I need to get over in
the first say 30 minutes of any assembly programming session, but once you get
into the zone, things can start flowing quite nicely. I'm an aggressive
debugger user; often, to get started I'll write a simple LEAF_ENTRY
that looks like this:

That'll allow me to attach the debugger and at least inspect the parameter
registers so I can write the next couple of instructions. I find it definitely
helps get me into the zone quicker.

Anyway, enough about that. Let's look at performance. Again, this will be an
interesting one — other than the optimal negative match logic that I
copied from version 1, the sole focus was on getting a working assembly version;
I wasn't giving any thought to performance at this stage.

So, it'll be interesting to see how it compares to a) version 1 in the negative
matching case (it should be very close), and b) against the C versions in the
prefix matching case (it hopefully won't be prohibitively worse).

Hmmm, that's not too bad! We're very close to version 1 for negative matching,
within about 0.5 cycles or so. That sounds about right, given that our initial
logic had to be tweaked a bit to play nicer with the rest of the implementation.
And we're still about 3-4 cycles faster than the fastest C version.

What about prefix matching performance?

The prefix matching performance isn't too bad either! We're definitely slower
than the C version, ranging from about 4 cycles to 10 cycles in most cases,
with the $INDEX_ALLOCATION input about 13 cycles slower.

(I've just noticed the pattern with regards to the first 8 entries, $AttrDef to
$Mft, clocking in at about 18 and 24 cycles respectively. But the next four
entries, $Secure to $Cairo, consistently clock in at about 24 and 34 cycles
respectively. $Secure is the 9th slot, which puts it at memory offset 192 bytes
from the start of the string table. And then the 18 and 24 cycle behavior
returns for the last two items, ???? and ., which are
at the end of the string table's inner slot array. This pattern is prevalent in
all of our iterations. Very peculiar! We'll investigate later.)

IsPrefixOfStringInTable_x64_3

(We're nearly at the end of the first round of iterations, I promise!)

Seeing the performance of the second version in assembly, I decided to try whip
up a third version, which would switch from a LEAF_ENTRY to NESTED_ENTRY, and
use rep cmps for the byte comparison for long strings (instead of
the byte-by-byte approach used now).

In order to use rep cmps, you need to use two non-volatile
registers, rsi (the source index) and rdi (the
destination index). You also need to specify the direction of the comparison,
which means mutating the flags, which are also classed as non-volatile, so they
need to be pushed to the stack in the prologue and popped back off in the
epilogue.

I didn't really expect this to offer a measurable speedup, but it was a tangible
reason to use a NESTED_ENTRY, and otherwise allowed me to stay within the
confines of the version 2 implementation.

Let's take a look at the implementation. At the very least, it's useful to see
how you can go about organizing your prologue in MASM. For
NESTED_ENTRY routines, I always define a Locals
structure that encorporates the return address and home parameter space for easy
access. Mainly because it allows me to write code like this:

This routine was written last, after version 10 of the C routine, so it
incorporates the slightly re-arranged loop logic that proved to be faster for
that version. Other than that, the main changes involved converting all the
early exit returns in the body of the function to jump to a single exit point,
Pfx90, mainly to simplify epilogue exit code.

I don't have a strong hunch as to how this will perform; like I said earlier, it
was mainly done to set up the scaffolding for using a NESTED_ENTRY in the
future, such that we'll have the glue in place if we want to iterate on the
disassembly of the PGO versions. If I had to guess, I suspect it will be
slightly slower than version 2, but surely not by much, right? It's a pretty
minor change in the grand scheme of things. Let's take a look.

Hah! Version 3 is much, much worse! Even its negative matching performance is
terrible, which is the one thing the assembly versions have been good at so far.
How peculiar.

Now, in the interest of keeping events chronological, as much as I'd like to
dive in now and figure out why, I'll have to defer to my behavior when I
encountered this performance gap: I laughed, shelved the version 3 experiment,
and moved on.

That's a decidedly unsatisfying end to the matter, though, I'll admit. We'll
come back to it later in the article and try and get some closure as to why it
was so slow, comparatively.

Round 2 — Post-Internet Feedback

IsPrefixOfStringInTable_11

Both Fabian Giesen and
Wojciech Muła pointed out that
we could use _mm_andnot_si128() to avoid the need to invert the
results of the IncludeSlotsByLength XMM register (via
_mm_xor_si128()). Let's try that.

Diff

Full

% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_11.c
--- IsPrefixOfStringInTable_10.c 2018-04-26 10:38:09.357890400 -0400
+++ IsPrefixOfStringInTable_11.c 2018-04-26 12:43:44.184528000 -0400
@@ -18,7 +18,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_11(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -31,8 +31,8 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This version is based off version 8, but rewrites the inner loop that
- checks for comparisons.
+ This version is based off version 10, but with the vpandn used at the
+ end of the initial test, as suggested by Wojciech Mula (@pshufb).
Arguments:
@@ -70,9 +70,7 @@
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
- XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
- const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
@@ -158,28 +156,25 @@
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
- // we invert the mask shortly after.
+ // we do the "and not" intersection with the include slots next.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
- // Invert the result of the comparison; we want 0xff for slots to include
- // and 0x0 for slots to ignore (it's currently the other way around). We
- // can achieve this by XOR'ing the result against our all-ones XMM register.
- //
-
- IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
- //
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
+ // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+ // at the moment (we want 0xff for slots to include, and 0x00 for slots
+ // to ignore; it's currently the other way around), we use _mm_andnot_si128
+ // instead of just _mm_and_si128.
+ //
- IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
- IncludeSlotsByLength);
+ IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+ IncludeSlotsByUniqueChar);
//
// Generate a mask.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_11(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This version is based off version 10, but with the vpandn used at the
end of the initial test, as suggested by Wojciech Mula (@pshufb).
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlots;
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we do the "and not" intersection with the include slots next.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
// As the IgnoreSlotsByLength XMM register is the inverse of what we want
// at the moment (we want 0xff for slots to include, and 0x00 for slots
// to ignore; it's currently the other way around), we use _mm_andnot_si128
// instead of just _mm_and_si128.
//
IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
IncludeSlotsByUniqueChar);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
return NO_MATCH_FOUND;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched < Length && Length <= 16) {
//
// The slot length is longer than the number of characters matched
// from the search string; this isn't a prefix match. Continue.
//
continue;
}
if (Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
}
}
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
} while (--Count);
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

We're only shaving one instruction off here, so the performance gain, if any,
should be very modest.

Definitely a slight improvement over version 10 in most cases!

IsPrefixOfStringInTable_x64_4

Something I didn't know about vptest that Fabian pointed out is
that it actually does two operations. The first essentially does an AND of the
two input registers and sets the zero flag (ZF=1) if the result is all 0s.
We've been using that aspect in the assembly version up to now.

However, it also does the equivalent of (xmm0 and (not xmm1)), and
sets the carry flag (CY=1) if that expression evaluates to all zeros. That's
handy, because it's exactly the expression we want to do!

So, let's take version 2 of our assembly routine, remove the vpxor bit, and
re-arrange the vptest inputs such that we can do a jnc instead of
jnz:

Let's see how that stacks up against the existing version 2 of the assembly
routine:

Nice, we've shaved an entire cycle off the negative match path! I say that both
seriously and sarcastically. A single cycle, wow, stop the press! On the other
hand, going from 8 cycles to 7 cycles is usually a lot harder than, say, going
from 100,000 cycles to 80,000 cycles. We're so close to the lower bound,
additional cycle improvements is a lot like trying to get blood out of a stone.

IsPrefixOfStringInTable_12

The vptest fast-path exit definitely yielded a repeatable and measurable gain
for the assembly version. Let's replicate it in a C version.

Diff

Full

% diff -u IsPrefixOfStringInTable_10.c IsPrefixOfStringInTable_12.c
--- IsPrefixOfStringInTable_10.c 2018-04-26 13:28:06.006627100 -0400
+++ IsPrefixOfStringInTable_12.c 2018-04-26 17:47:54.970331600 -0400
@@ -19,7 +19,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_10(
+IsPrefixOfStringInTable_12(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -32,8 +32,15 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This version is based off version 8, but rewrites the inner loop that
- checks for comparisons.
+ This version is based off version 10, but with factors in the improvements
+ made to version 4 of the x64 assembly version, thanks to suggestions from
+ both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
+
+ Like version 11, we omit the vpxor to invert the lengths, but instead of
+ an initial vpandn, we leverage the fact that vptest sets the carry flag
+ if all 0s result from the expression: "param1 and (not param2)". This
+ allows us to do a fast-path early exit (like x64 version 2 does) if no
+ match is found.
Arguments:
@@ -71,9 +78,7 @@
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
- XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
- const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
@@ -159,47 +164,58 @@
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
- // we invert the mask shortly after.
+ // we do the "and not" intersection with the include slots next.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
- // Invert the result of the comparison; we want 0xff for slots to include
- // and 0x0 for slots to ignore (it's currently the other way around). We
- // can achieve this by XOR'ing the result against our all-ones XMM register.
+ // We can do a fast-path test for no match here via _mm_testc_si128(),
+ // which is essentially equivalent to the following logic, just with
+ // fewer instructions:
//
-
- IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
-
- //
- // We're now ready to intersect the two XMM registers to determine which
- // slots should still be included in the comparison (i.e. which slots have
- // the exact same unique character as the string and a length less than or
- // equal to the length of the search string).
+ // IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+ // IncludeSlotsByUniqueChar);
//
-
- IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
- IncludeSlotsByLength);
-
+ // if (!IncludeSlots) {
+ // return NO_MATCH_FOUND;
+ // }
//
- // Generate a mask.
//
- Bitmap = _mm_movemask_epi8(IncludeSlots);
-
- if (!Bitmap) {
+ if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {
//
- // No bits were set, so there are no strings in this table starting
- // with the same character and of a lesser or equal length as the
- // search string.
+ // No remaining slots were left after we intersected the slots with
+ // matching unique characters with the inverted slots to ignore due
+ // to length. Thus, no prefix match was found.
//
return NO_MATCH_FOUND;
}
//
+ // Continue with the remaining logic, including actually generating the
+ // IncludeSlots, which we need for bitmap generation as part of our
+ // comparison loop.
+ //
+ // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+ // at the moment (we want 0xff for slots to include, and 0x00 for slots
+ // to ignore; it's currently the other way around), we use _mm_andnot_si128
+ // instead of just _mm_and_si128.
+ //
+
+ IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+ IncludeSlotsByUniqueChar);
+
+ //
+ // Generate a mask, count the number of bits, and initialize the search
+ // length.
+ //
+
+ Bitmap = _mm_movemask_epi8(IncludeSlots);
+
+ //
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_12(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This version is based off version 10, but with factors in the improvements
made to version 4 of the x64 assembly version, thanks to suggestions from
both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
Like version 11, we omit the vpxor to invert the lengths, but instead of
an initial vpandn, we leverage the fact that vptest sets the carry flag
if all 0s result from the expression: "param1 and (not param2)". This
allows us to do a fast-path early exit (like x64 version 2 does) if no
match is found.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Count;
ULONG Length;
ULONG Index;
ULONG Shift = 0;
ULONG CharactersMatched;
ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlots;
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we do the "and not" intersection with the include slots next.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// We can do a fast-path test for no match here via _mm_testc_si128(),
// which is essentially equivalent to the following logic, just with
// fewer instructions:
//
// IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
// IncludeSlotsByUniqueChar);
//
// if (!IncludeSlots) {
// return NO_MATCH_FOUND;
// }
//
//
if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {
//
// No remaining slots were left after we intersected the slots with
// matching unique characters with the inverted slots to ignore due
// to length. Thus, no prefix match was found.
//
return NO_MATCH_FOUND;
}
//
// Continue with the remaining logic, including actually generating the
// IncludeSlots, which we need for bitmap generation as part of our
// comparison loop.
//
// As the IgnoreSlotsByLength XMM register is the inverse of what we want
// at the moment (we want 0xff for slots to include, and 0x00 for slots
// to ignore; it's currently the other way around), we use _mm_andnot_si128
// instead of just _mm_and_si128.
//
IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
IncludeSlotsByUniqueChar);
//
// Generate a mask, count the number of bits, and initialize the search
// length.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
//
// A popcount against the mask will tell us how many slots we matched, and
// thus, need to compare.
//
Count = __popcnt(Bitmap);
do {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap and adding the amount we've already shifted by.
//
NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
Index = NumberOfTrailingZeros + Shift;
//
// Shift the bitmap right, past the zeros and the 1 that was just found,
// such that it's positioned correctly for the next loop's tzcnt. Update
// the shift count accordingly.
//
Bitmap >>= (NumberOfTrailingZeros + 1);
Shift = Index + 1;
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched < Length && Length <= 16) {
//
// The slot length is longer than the number of characters matched
// from the search string; this isn't a prefix match. Continue.
//
continue;
}
if (Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
}
}
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
} while (--Count);
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

Eh, there's not much in this one. The negative match fast path is basically
identical, and the normal prefix matches are a tiny bit slower.

IsPrefixOfStringInTable_13

Another tip
from Fabian: we can tweak the loop logic further. Instead of
shifting the bitmap right each iteration (and keeping a separate shift count),
we can just leverage the blsr intrinsic, which stands for reset
lowest set bit, and is equivalent to doing x & (x -1).
This allows us to tweak the loop organization as well, such that we can simply
do while (Bitmap) { } instead of the do { } while (--Count)
approach we've been using.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_13(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This version is based off version 10, but does away with the bitmap
shifting logic and `do { } while (--Count)` loop, instead simply using
blsr in conjunction with `while (Bitmap) { }`. Credit goes to Fabian
Giesen (@rygorous) for pointing this approach out.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Length;
ULONG Index;
ULONG CharactersMatched;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// Invert the result of the comparison; we want 0xff for slots to include
// and 0x0 for slots to ignore (it's currently the other way around). We
// can achieve this by XOR'ing the result against our all-ones XMM register.
//
IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
//
// We're now ready to intersect the two XMM registers to determine which
// slots should still be included in the comparison (i.e. which slots have
// the exact same unique character as the string and a length less than or
// equal to the length of the search string).
//
IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
IncludeSlotsByLength);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
if (!Bitmap) {
//
// No bits were set, so there are no strings in this table starting
// with the same character and of a lesser or equal length as the
// search string.
//
return NO_MATCH_FOUND;
}
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
while (Bitmap) {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap.
//
Index = _tzcnt_u32(Bitmap);
//
// Clear the bitmap's lowest set bit, such that it's ready for the next
// loop's tzcnt if no match is found in this iteration. Equivalent to
//
// Bitmap &= Bitmap - 1;
//
// (Which the optimizer will convert into a blsr instruction anyway in
// non-debug builds. But it's nice to be explicit.)
//
Bitmap = _blsr_u32(Bitmap);
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched < Length && Length <= 16) {
//
// The slot length is longer than the number of characters matched
// from the search string; this isn't a prefix match. Continue.
//
continue;
}
if (Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
}
}
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

I like this change. It was a great suggestion from Fabian. Let's see how it
performs. Hopefully it'll do slightly better at prefix matching, given that
we're effectively reducing the number of instructions required as part of the
string comparison logic.

IsPrefixOfStringInTable_14

Let's give the C version the same chance as the assembly version with regards to
negative matching; we'll take version 13 above and factor in the
vptest logic from version 12.

Diff (14 vs 13)

Diff (14 vs 12)

Full

% diff -u IsPrefixOfStringInTable_13.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_13.c 2018-04-26 19:16:34.926170200 -0400
+++ IsPrefixOfStringInTable_14.c 2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_13(
+IsPrefixOfStringInTable_14(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -32,10 +32,8 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This version is based off version 10, but does away with the bitmap
- shifting logic and `do { } while (--Count)` loop, instead simply using
- blsr in conjunction with `while (Bitmap) { }`. Credit goes to Fabian
- Giesen (@rygorous) for pointing this approach out.
+ This version combines the altered bitmap logic from version 13 with the
+ fast-path _mm_testc_si128() exit from version 12.
Arguments:
@@ -70,9 +68,7 @@
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
- XMMWORD IncludeSlotsByLength;
XMMWORD IncludeSlots;
- const XMMWORD AllOnesXmm = _mm_set1_epi8(0xff);
//
// Unconditionally do the following five operations before checking any of
@@ -164,22 +160,43 @@
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
- // Invert the result of the comparison; we want 0xff for slots to include
- // and 0x0 for slots to ignore (it's currently the other way around). We
- // can achieve this by XOR'ing the result against our all-ones XMM register.
+ // We can do a fast-path test for no match here via _mm_testc_si128(),
+ // which is essentially equivalent to the following logic, just with
+ // fewer instructions:
//
+ // IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+ // IncludeSlotsByUniqueChar);
+ //
+ // if (!IncludeSlots) {
+ // return NO_MATCH_FOUND;
+ // }
+ //
+ //
+
+ if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {
- IncludeSlotsByLength = _mm_xor_si128(IgnoreSlotsByLength, AllOnesXmm);
+ //
+ // No remaining slots were left after we intersected the slots with
+ // matching unique characters with the inverted slots to ignore due
+ // to length. Thus, no prefix match was found.
+ //
+
+ return NO_MATCH_FOUND;
+ }
//
- // We're now ready to intersect the two XMM registers to determine which
- // slots should still be included in the comparison (i.e. which slots have
- // the exact same unique character as the string and a length less than or
- // equal to the length of the search string).
+ // Continue with the remaining logic, including actually generating the
+ // IncludeSlots, which we need for bitmap generation as part of our
+ // comparison loop.
+ //
+ // As the IgnoreSlotsByLength XMM register is the inverse of what we want
+ // at the moment (we want 0xff for slots to include, and 0x00 for slots
+ // to ignore; it's currently the other way around), we use _mm_andnot_si128
+ // instead of just _mm_and_si128.
//
- IncludeSlots = _mm_and_si128(IncludeSlotsByUniqueChar,
- IncludeSlotsByLength);
+ IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
+ IncludeSlotsByUniqueChar);
//
// Generate a mask.
@@ -187,17 +204,6 @@
Bitmap = _mm_movemask_epi8(IncludeSlots);
- if (!Bitmap) {
-
- //
- // No bits were set, so there are no strings in this table starting
- // with the same character and of a lesser or equal length as the
- // search string.
- //
-
- return NO_MATCH_FOUND;
- }
-
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.

% diff -u IsPrefixOfStringInTable_12.c IsPrefixOfStringInTable_14.c
--- IsPrefixOfStringInTable_12.c 2018-04-26 17:47:54.970331600 -0400
+++ IsPrefixOfStringInTable_14.c 2018-04-26 19:32:30.674199200 -0400
@@ -19,7 +19,7 @@
_Use_decl_annotations_
STRING_TABLE_INDEX
-IsPrefixOfStringInTable_12(
+IsPrefixOfStringInTable_14(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
@@ -32,15 +32,8 @@
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
- This version is based off version 10, but with factors in the improvements
- made to version 4 of the x64 assembly version, thanks to suggestions from
- both Wojciech Mula (@pshufb) and Fabian Giesen (@rygorous).
-
- Like version 11, we omit the vpxor to invert the lengths, but instead of
- an initial vpandn, we leverage the fact that vptest sets the carry flag
- if all 0s result from the expression: "param1 and (not param2)". This
- allows us to do a fast-path early exit (like x64 version 2 does) if no
- match is found.
+ This version combines the altered bitmap logic from version 13 with the
+ fast-path _mm_testc_si128() exit from version 12.
Arguments:
@@ -61,12 +54,9 @@
{
ULONG Bitmap;
ULONG Mask;
- ULONG Count;
ULONG Length;
ULONG Index;
- ULONG Shift = 0;
ULONG CharactersMatched;
- ULONG NumberOfTrailingZeros;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
@@ -118,7 +108,7 @@
// Load the first 16-bytes of the search string into an XMM register.
//
- Search.CharsXmm = _mm_load_si128((PXMMWORD)String->Buffer);
+ Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
@@ -164,7 +154,7 @@
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
- // we do the "and not" intersection with the include slots next.
+ // we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
@@ -209,8 +199,7 @@
IncludeSlotsByUniqueChar);
//
- // Generate a mask, count the number of bits, and initialize the search
- // length.
+ // Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
@@ -222,31 +211,26 @@
SearchLength = min(String->Length, 16);
- //
- // A popcount against the mask will tell us how many slots we matched, and
- // thus, need to compare.
- //
-
- Count = __popcnt(Bitmap);
-
- do {
+ while (Bitmap) {
//
// Extract the next index by counting the number of trailing zeros left
- // in the bitmap and adding the amount we've already shifted by.
+ // in the bitmap.
//
- NumberOfTrailingZeros = _tzcnt_u32(Bitmap);
- Index = NumberOfTrailingZeros + Shift;
+ Index = _tzcnt_u32(Bitmap);
//
- // Shift the bitmap right, past the zeros and the 1 that was just found,
- // such that it's positioned correctly for the next loop's tzcnt. Update
- // the shift count accordingly.
+ // Clear the bitmap's lowest set bit, such that it's ready for the next
+ // loop's tzcnt if no match is found in this iteration. Equivalent to
+ //
+ // Bitmap &= Bitmap - 1;
+ //
+ // (Which the optimizer will convert into a blsr instruction anyway in
+ // non-debug builds. But it's nice to be explicit.)
//
- Bitmap >>= (NumberOfTrailingZeros + 1);
- Shift = Index + 1;
+ Bitmap = _blsr_u32(Bitmap);
//
// Load the slot and its length.
@@ -329,7 +313,7 @@
return (STRING_TABLE_INDEX)Index;
- } while (--Count);
+ }
//
// If we get here, we didn't find a match.

_Use_decl_annotations_
STRING_TABLE_INDEX
IsPrefixOfStringInTable_14(
PSTRING_TABLE StringTable,
PSTRING String,
PSTRING_MATCH Match
)
/*++
Routine Description:
Searches a string table to see if any strings "prefix match" the given
search string. That is, whether any string in the table "starts with
or is equal to" the search string.
This version combines the altered bitmap logic from version 13 with the
fast-path _mm_testc_si128() exit from version 12.
Arguments:
StringTable - Supplies a pointer to a STRING_TABLE struct.
String - Supplies a pointer to a STRING struct that contains the string to
search for.
Match - Optionally supplies a pointer to a variable that contains the
address of a STRING_MATCH structure. This will be populated with
additional details about the match if a non-NULL pointer is supplied.
Return Value:
Index of the prefix match if one was found, NO_MATCH_FOUND if not.
--*/
{
ULONG Bitmap;
ULONG Mask;
ULONG Length;
ULONG Index;
ULONG CharactersMatched;
ULONG SearchLength;
PSTRING TargetString;
STRING_SLOT Slot;
STRING_SLOT Search;
STRING_SLOT Compare;
SLOT_LENGTHS Lengths;
XMMWORD LengthXmm;
XMMWORD UniqueChar;
XMMWORD TableUniqueChars;
XMMWORD IncludeSlotsByUniqueChar;
XMMWORD IgnoreSlotsByLength;
XMMWORD IncludeSlots;
//
// Unconditionally do the following five operations before checking any of
// the results and determining how the search should proceed:
//
// 1. Load the search string into an Xmm register, and broadcast the
// character indicated by the unique character index (relative to
// other strings in the table) across a second Xmm register.
//
// 2. Load the string table's unique character array into an Xmm register.
//
// 3. Broadcast the search string's length into an XMM register.
//
// 3. Load the string table's slot lengths array into an XMM register.
//
// 4. Compare the unique character from step 1 to the string table's unique
// character array set up in step 2. The result of this comparison
// will produce an XMM register with each byte set to either 0xff if
// the unique character was found, or 0x0 if it wasn't.
//
// 5. Compare the search string's length from step 3 to the string table's
// slot length array set up in step 3. This allows us to identify the
// slots that have strings that are of lesser or equal length to our
// search string. As we're doing a prefix search, we can ignore any
// slots longer than our incoming search string.
//
// We do all five of these operations up front regardless of whether or not
// they're strictly necessary. That is, if the unique character isn't in
// the unique character array, we don't need to load array lengths -- and
// vice versa. However, we assume the benefits afforded by giving the CPU
// a bunch of independent things to do unconditionally up-front outweigh
// the cost of putting in branches and conditionally loading things if
// necessary.
//
//
// Load the first 16-bytes of the search string into an XMM register.
//
Search.CharsXmm = _mm_loadu_si128((PXMMWORD)String->Buffer);
//
// Broadcast the search string's unique characters according to the string
// table's unique character index.
//
UniqueChar = _mm_shuffle_epi8(Search.CharsXmm,
StringTable->UniqueIndex.IndexXmm);
//
// Load the slot length array into an XMM register.
//
Lengths.SlotsXmm = _mm_load_si128(&StringTable->Lengths.SlotsXmm);
//
// Load the string table's unique character array into an XMM register.
//
TableUniqueChars = _mm_load_si128(&StringTable->UniqueChars.CharsXmm);
//
// Broadcast the search string's length into an XMM register.
//
LengthXmm.m128i_u8[0] = (BYTE)String->Length;
LengthXmm = _mm_broadcastb_epi8(LengthXmm);
//
// Compare the search string's unique character with all of the unique
// characters of strings in the table, saving the results into an XMM
// register. This comparison will indicate which slots we can ignore
// because the characters at a given index don't match. Matched slots
// will be 0xff, unmatched slots will be 0x0.
//
IncludeSlotsByUniqueChar = _mm_cmpeq_epi8(UniqueChar, TableUniqueChars);
//
// Find all slots that are longer than the incoming string length, as these
// are the ones we're going to exclude from any prefix match.
//
// N.B. Because we default the length of empty slots to 0x7f, they will
// handily be included in the ignored set (i.e. their words will also
// be set to 0xff), which means they'll also get filtered out when
// we invert the mask shortly after.
//
IgnoreSlotsByLength = _mm_cmpgt_epi8(Lengths.SlotsXmm, LengthXmm);
//
// We can do a fast-path test for no match here via _mm_testc_si128(),
// which is essentially equivalent to the following logic, just with
// fewer instructions:
//
// IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
// IncludeSlotsByUniqueChar);
//
// if (!IncludeSlots) {
// return NO_MATCH_FOUND;
// }
//
//
if (_mm_testc_si128(IgnoreSlotsByLength, IncludeSlotsByUniqueChar)) {
//
// No remaining slots were left after we intersected the slots with
// matching unique characters with the inverted slots to ignore due
// to length. Thus, no prefix match was found.
//
return NO_MATCH_FOUND;
}
//
// Continue with the remaining logic, including actually generating the
// IncludeSlots, which we need for bitmap generation as part of our
// comparison loop.
//
// As the IgnoreSlotsByLength XMM register is the inverse of what we want
// at the moment (we want 0xff for slots to include, and 0x00 for slots
// to ignore; it's currently the other way around), we use _mm_andnot_si128
// instead of just _mm_and_si128.
//
IncludeSlots = _mm_andnot_si128(IgnoreSlotsByLength,
IncludeSlotsByUniqueChar);
//
// Generate a mask.
//
Bitmap = _mm_movemask_epi8(IncludeSlots);
//
// Calculate the "search length" of the incoming string, which ensures we
// only compare up to the first 16 characters.
//
SearchLength = min(String->Length, 16);
while (Bitmap) {
//
// Extract the next index by counting the number of trailing zeros left
// in the bitmap.
//
Index = _tzcnt_u32(Bitmap);
//
// Clear the bitmap's lowest set bit, such that it's ready for the next
// loop's tzcnt if no match is found in this iteration. Equivalent to
//
// Bitmap &= Bitmap - 1;
//
// (Which the optimizer will convert into a blsr instruction anyway in
// non-debug builds. But it's nice to be explicit.)
//
Bitmap = _blsr_u32(Bitmap);
//
// Load the slot and its length.
//
Slot.CharsXmm = _mm_load_si128(&StringTable->Slots[Index].CharsXmm);
Length = Lengths.Slots[Index];
//
// Compare the slot to the search string.
//
Compare.CharsXmm = _mm_cmpeq_epi8(Slot.CharsXmm, Search.CharsXmm);
//
// Create a mask of the comparison, then filter out high bits from the
// search string's length (which is capped at 16). (This shouldn't be
// technically necessary as the string array buffers should have been
// calloc'd and zeroed, but optimizing compilers can often ignore the
// zeroing request -- which can produce some bizarre results where the
// debug build is correct (because the buffers were zeroed) but the
// release build fails because the zeroing got ignored and there are
// junk bytes past the NULL terminator, which get picked up in our
// 128-bit loads.)
//
Mask = _bzhi_u32(_mm_movemask_epi8(Compare.CharsXmm), SearchLength);
//
// Count how many characters matched.
//
CharactersMatched = __popcnt(Mask);
if ((USHORT)CharactersMatched < Length && Length <= 16) {
//
// The slot length is longer than the number of characters matched
// from the search string; this isn't a prefix match. Continue.
//
continue;
}
if (Length > 16) {
//
// The first 16 characters in the string matched against this
// slot, and the slot is oversized (longer than 16 characters),
// so do a direct comparison between the remaining buffers.
//
TargetString = &StringTable->pStringArray->Strings[Index];
CharactersMatched = IsPrefixMatch(String, TargetString, 16);
if (CharactersMatched == NO_MATCH_FOUND) {
//
// The prefix match failed, continue our search.
//
continue;
}
}
//
// This slot is a prefix match. Fill out the Match structure if the
// caller provided a non-NULL pointer, then return the index of the
// match.
//
if (ARGUMENT_PRESENT(Match)) {
Match->Index = (BYTE)Index;
Match->NumberOfMatchedCharacters = (BYTE)CharactersMatched;
Match->String = &StringTable->pStringArray->Strings[Index];
}
return (STRING_TABLE_INDEX)Index;
}
//
// If we get here, we didn't find a match.
//
//IACA_VC_END();
return NO_MATCH_FOUND;
}

We're really clutching at straws here obviously with regards to trying to eke
out more performance. The _mm_testc_si128() alteration was a tiny
bit slower for version 12 across the
board. However, the vptest (which is the underlying assembly
instruction that maps to the _mm_testc_si128() intrinsic) version
4 of our assembly was definitely a little bit faster than the other versions.
Let's see how our final C version performs:

Welp, at least it's consistent! Like version 12, the _mm_testc_si128()
change doesn't really offer a compelling improvement for version 14. That makes
version 13 officially our fastest C implementation for round 2.

IsPrefixOfStringInTable_x64_5

Before we conclude round 2, let's see if we can eke any more performance out of
the negative match fast path of our fastest assembly version so far: version 4.
For this step, I'm going to leverage
Intel Architecture Code Analyzer, or IACA, for short.

This is a handy little static analysis tool that can provide useful information
for fine-tuning performance sensitive code. Let's take a look at the output
from IACA for our assembly version 4. To do this, I uncomment the two macros,
IACA_VC_START and IACA_VC_END, which reside at the
start and end of the negative match logic. These macros are defined in
StringTable.inc,
and look like this:

You may have noticed commented-out versions of these macros in both the C and
assembly code. What they do is emit a specific byte pattern in the instruction
byte code that the IACA tool can detect. You place the start and end markers
around the code you're interested in, recompile it, then run IACA against the
final executable (or library).

Let's see what happens when we do this for our version 4 assembly routine. I'll
include the relevant assembly snippet, reformatted into a more concise fashion,
followed by the IACA output (also reformatted into a more concise fashion):

The
Intel Architecture Code Analyzer User Manual (v3.0) provides decent
documentation about how to interpret the output, so I won't go over the gory
details. What I'm really looking at in this pass is what my block throughput
is, and potentially what the bottleneck is.

In this case, our block throughput is being reported as 3.74 cycles, which
basically indicates how many CPU cycles it takes to execute the block. Our
bottleneck is dependency chains, which refers to the situation where, say,
instruction C can't start because the results from instruction A aren't
ready yet. (This... this is a vastly simplified explanation.)

Alright, well, what can we do? A good answer would be that with an intimate
understanding of contemporary Intel CPU architecture, you can pin-point exactly
what needs changing in order to reduce dependencies, and maximise port
utilization, and leverage macro fusion, but also not forgetting about
microfusion, and remembering microcode latencies, and generally become one with
the Intel optimization manual, but never at the expense of under-utilizing
your front-back µop-frobulator, unless the inverted cache re-up and
re-vigor policy is in its hybrid coalesced L9 state, at which point any increase
in thermal unit pivot-bracketing will nullify all efforts across product lines 4
and 5 but exponentially accelerate entropy discount licensing for models 6 and
7, but only after recent microcode patches if you're in the Northern hemisphere
during an unseasonably cold Spring.

Or you can just move shit around until the number gets smaller. That's what I did.

Well, that's not entirely true. Fabian did make a good suggestion when he was
reviewing some of my assembly that I was often needlessly doing a load into an
XMM register only to use it once in a subsequent operation. Instead of doing
that I could just use the load-op version of the instruction, which
allows for an instruction input parameter to be sourced from memory.

But yeah, other than a few load-op tweaks, I basically just shuffled shit around
until the block throughput reported lower. Very rigorous methodology, I know.
Here's the final version, which also happens to be the version quoted in the
introduction of this article:

As you can see, that is reporting a block throughput of 3.48 instead of 3.74. A
whopping 0.26 reduction! Also note the bottleneck is now being reported as
FrontEnd, which basically means that the thing holding up this code
now is literally the CPU's ability to decode the actual instruction stream into
actionable internal work. (Again, super simplistic explanation of a very complex
process.)

For the sake of completeness, here's the proper diff and full version of
assembly version 5:

Reviewing IsPrefixOfStringInTable_x64_3...

What immediately stands out to me with those results is how
everything seems to be impacted; it's not just the prefix
matching performance that's bad, it's the negative match performance as well.
This is odd, as we didn't really change anything in the negative match logic.

Except for that pesky prologue we added to stash the values of rsi,
rdi, and the flags register. Hmmm! That seems like a good a place
as any to start investigating. Let's whip up another version that defers the
prologue until after the initial negative match logic. This
exploits a little detail regarding prologues in that they need to appear in the
first 255 bytes of the function byte code — but don't necessarily need to
appear at the very start. As long as the prologue definition for the register
is the first time the register is mutated, you've got a bit of room to play with
regarding where to actually put it.

So, here's version 7 of the routine, based off version 3, that simply relocates
the prologue code to appear after the initial negative match logic:

As you can see, the prolog value has changed to 0x4c,
and the offsets for each entry have also changed accordingly. Let's disassemble
the function and see if we can correlate the addresses of our prologue
instructions to the offsets indicated above:

All of the addresses share 0x00007fff`f8424 as the first 13 digits,
so we can ignore that part to simplify the values we're working with. Let's
take a look at the first of our prologue instructions, sub rsp,
20h. This maps to our alloc_stack LOCALS_SIZE line:

The sub rsp, 20h line appears at byte offset 0x57d.
If we subtract that from the address of the very first instruction,
0x540, we get 61, or 0x3d in hex.

Hmmmm. That doesn't map to any of the offsets that appear in version 7's
runtime function entry. Let's try the address of the pushfq
instruction, which is at offset 0x581. If we subtract the
start address 0x540 from that, we're left with 65, which in hex
is, drum roll, 0x41! That matches the last line of the
runtime function entry:

That makes sense if we think about the purpose of the unwind entries. They are
there for the kernel to compare against a faulting instruction's address (i.e.
the value contained in the RIP register at the time of the fault) in order to
determine what needs to be unwound as part of exception handling. In this case,
at byte offset 0x41, the sub rsp, 20h instruction will
have already been executed, so the kernel knows it needs to unwind this (e.g. by
doing what will effectively equate to add rsp, 20h) within the
exception handling logic when it needs to unwind the entire frame and restore
all of the non-volatile registers.

If we take a look at the first instruction after
our last prologue instruction, vpextrb r11d, xmm4, 0, it resides at
address offset 0x58c. Subtracting the start address
0x540 from that, we get 76, which is 0x4c in hex,
which matches the offset of the last unwind entry, as well as the prologue
end point:

The reason that the prologue must occur within the first 255 bytes of the
function is simply due to the fact that the prologue size and offsets are
stored using a single byte, so 255 is the maximum value that can be represented.
When writing a NESTED_ENTRY with MASM, you need to have the
END_PROLOGUE macro (which expands to .endprolog)
occur within the first 255 bytes of your function.

If we move the END_PROLOGUE line in version 7 way down to the
bottom of the routine and try and compile, MASM balks:

(Note: I have no idea why the spelling of prolog vs prologue and epilog vs
epilogue is so inconsistent within the Microsoft tooling and docs.)

Let's get back on track. We need to review the performance of version 7 to see
if relocating the prologue has any impact on the negative matching performance
of the routine. If it does, this is a strong indicator that it's at fault,
especially if the prefix matching still shows the same performance issues.
Here's the comparison:

Hah! Look at that, the negative match performance is back on par with version
2. So, the blame now squarely points to something peculiar in the prologue
inducing a huge (well, relatively huge) performance hit. But the prologue is so
simple! It's only pushing flags, and two registers!

IsPrefixOfStringInTable_x64_8

I know register pushing is cheap. Borderline free in the grand scheme of
things. Flags though. Flags are an interesting one. The bane of the out-of-order
CPU pipeline, they could very well be forcing a synchronization point within the
code, preventing all the contemporary goodies you get when you let the CPU do
its thing whenever it wants, rather than when you need. (Goodies like...
Meltdown!)

Let's test the theory. We'll take version 7 and simply comment-out the flag
pushing and popping behavior.

(Technically we're not allowed to do that; the direction indicator is classed as
non-volatile; if the calling function has it set to reverse, and on return,
we've set it to forward, things are going to be problematic if it actually
wanted it set to reverse. In practice, this isn't that common.
At least with our current stack, what with our aversion to even using a C
runtime library, we know nothing in our benchmark environment is going to be
faced with that predicament.)

Crikey! Flags were clearly at fault! Not only that, but look at the
performance of the routine in comparison with version 2 for prefix matching;
there's a definite improvement in performance! (I also looked up the latency of
pushfq: 9 cycles! I had no idea it was that expensive.)

....wait wait wait. Shut the front door! This new assembly version is nearly
as fast as the fastest C versions, and it doesn't even have the optimized
negative match re-work in place. Plot twist!

Either way, it means we means we might be able to wrangle an assembly version
that can dominate the negative matching fast path and give the
C version a run for its money with prefix matching, which would be a great way
to end the article! Let's give it a shot.

(Note: this is the first point in the article where I'm not retroactively
documenting what I've done — it's all live! I have no idea if I'll
be able to produce a final assembly version that's competitive with C in
all aspects. Then again, I'm persistent and stubborn, so who knows.)

We'll do this in a couple of pieces. First, we'll convert version 8 (which has
version 3's logic) into a LEAF_ENTRY and restore the byte-by-byte
comparison logic instead of repe cmpsb, but keep everything else
identical. This will be version 9. For version 10, we can tidy up version 9 a
bit and replace some of the jumps to the epilogue area (Pfx90) with
a simple ret where applicable.

From there, we'll make version 11, which will combine version 10 and the
optimized negative match logic we established in the assembly version 5.
After that, we can use versions 12 onward to try replicate the superior
inner loop approach identified by Fabian that led to the C routine
IsPrefixOfStringInTable_13. And to
think we were almost going to publish this article without investigating the
slowdown associated with version 3 of the assembly!

IsPrefixOfStringInTable_x64_9

As mentioned, let's take the version 8 NESTED_ENTRY and convert
it into a LEAF_ENTRY with the least amount of code churn possible.
As version 8 is essentially version 3 with a relocated prologue and the
push_eflags/popfq bits commented out, I'll provide a
diff against version 3 as well.

Let's review performance. I'll omit the C versions from the graphs for now
whilst we focus on optimizing the assembly versions. In this next comparison,
we want to verify that we're still seeing the performance gains we saw in
version 8 in versions 9 and 10. If the timings for version 9 and 10 differ,
I'd expect version 10 to be better — but it won't be by much.

(Note: I had to generate new CSV files for these graphs, as the old ones
didn't have any timings for these new functions we've added. It's easier
to just regenerate timings for everything, versus trying to splice in the
new timings into the old files. So, there will be small differences in
the numbers you see here for old routines referenced earlier (i.e. the
timings for assembly versions 2, 4, 5 and 8 aren't identical to earlier
graphs). The differences are negligible (a handful of cycles per 1000
iterations). I'll put the GitHub URL of the corresponding source data
used to generate each graph herein. They'll all live within
this
directory, which contains all the source for everything in this
article.)

Excellent! Version 10 is a tiny bit faster than 9, but both retain the speed
advantages we saw from version 8. We can also see how expensive the setup cost
is for repe cmpsb, too, which version 8 used. It's not necessarily
a fair comparison, as only one byte is being compared ($INDEX_ALLOCATION is 17
bytes long; so we're only comparing the last N letter), and there's a fixed
overhead with the repe cmp/stos/lods-type instructions that can't
be avoided. (They can prove optimal for longer sequences, though.)

IsPrefixOfStringInTable_x64_11

Let's take version 10 and blend in the optimal negative match instruction
ordering we used for version 5. (Version 10 is essentially derived from
version 3, and we wrote that before we'd come up with the optimizations
explored in versions 4 and 5.)

Notice the similarity between the diff above and the one for
IsPrefixOfStringInTable_x64_5.
Let's see how the performance compares. The negative match performance for
version 11 should be on par with version 5.

We have a new winner! Version 11 is now the fastest assembly version across the
board for both prefix and negative matching. Before we start our final pass on
version 12, let's take a quick look at how we currently compare against the
fastest C version:

It's already very close! We just need to shave off a few more cycles on the
assembly version to take the crown.

IsPrefixOfStringInTable_x64_12

Let's start with updating the main loop logic such that it matches
IsPrefixOfStringInTable_13. We'll
omit the bitmap shifting and loop count in lieu of the blsr
approach.

(About 5 hours pass...)

Alright, I'm back! Version 12 of our assembly routine is complete! This was
the first big major change to the routine since version 2 really, and I had the
benefit of the past ~220 hours already spent obsessing over this topic, so, I'm
actually pretty happy with the result! Let's take a look. (The diff view of
this version is pretty messy compared to the others, given the increased amount
of code churn that was involved.)

I'm really happy with how that turned out! Switching to blsr
really improved the layout of the inner loop, and vastly reduced our register
pressure, which means less XMM register spilling is required, which is always a
good thing.

But does it improve performance? Eek! It's our final Hail Mary attempt at an
improvement. Can we beat the fastest profile-guided optimization build of the C
version in both prefix matching and negative matching?

*Drum roll*

(This page doesn't have any ads. But if it did, I'd totally put them here.
All sneaky like, just as the article gets interesting.)

The performance for version 12 of the assembly is...

F$#*@%ing ey, look at that! :-)

The assembly version brings in gold across the board! Hot damn! A quick run
through VTune suggests the routine is clocking in with a CPI of 0.266, which is
pretty darn close to the theoretical maximum of 0.25 (which implies 4
instructions retired per clock cycle).

Other Applications

Once I'd written the first version of the StringTable component, for better or
worse, it became the hammer for all of my string-related problems! My favorite
example of this is the code I wrote for parsing the output of Windows debug
engine's examine symbols command.

Here's an example of a few lines of output from the cdb command
x /v /t Rtl!*:

The function
ExamineSymbolsParseLine is called for each line of output and is responsible
for parsing it into a
DEBUG_ENGINE_EXAMINED_SYMBOL structure. It's some good ol' fashioned string
processing using nothing but pointer arithmetic and a bunch of string tables.

It was the first time I needed to match more than 16 strings in a given category,
though. A pattern emerged that was quite reasonable, and it became my defacto
way of dealing with multiple string tables for a given category.

Let's look at the basic type category. Two string tables were
constructed from the following constant delimited strings
(view on GitHub)
:

//
// The order of these enumeration symbols must match the exact order of the
// corresponding string in the relevant ExamineSymbolsBasicTypes[1..n] STRING
// structure (see DebugEngineConstants.c). This is because string tables are
// created from the delimited strings and the match index is cast directly to
// an enum of this type.
//
typedef enum _DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE {
UnknownType = -1,
//
// First 16 types captured by BasicTypeStringTable1.
//
NoType = 0,
FunctionType,
CharType,
WideCharType,
ShortType,
LongType,
Integer64Type,
IntegerType,
UnsignedCharType,
UnsignedWideCharType,
UnsignedShortType,
UnsignedLongType,
UnsignedInteger64Type,
UnsignedIntegerType,
UnionType,
StructType,
//
// Next 16 types captured by BasicTypeStringTable2.
//
CLRType = 16,
BoolType,
VoidType,
ClassType,
FloatType,
DoubleType,
SALExecutionContextType,
ENativeStartupStateType,
//
// Any types that don't map directly to literal type names extracted from
// the output string are listed here. The first one starts at 48 in order
// to differentiate it from the string tables.
//
//
// Call site of an inline function.
//
InlineCallerType = 48,
//
// Enum is special in that it doesn't map to a string in the string table;
// if a type can't be inferred from the list above, it defaults to Enum.
//
EnumType,
//
// Any enumeration value >= InvalidType is invalid. Make sure this always
// comes last in the enum layout.
//
InvalidType
} DEBUG_ENGINE_EXAMINE_SYMBOLS_TYPE;

Here's the part of the logic within
ExamineSymbolsParseLine that deals with matching the basic type
part of the line. This refers to the 5th column of the output, e.g. the
struct,
char *[181],
<function>,
<CLR type>
bits in the following output:

If there's no match found, we check to see if we've performed the maximum number
of attempts, that is, whether or not we've exhausted all our string tables. If
we have, we just default to the EnumType.

Otherwise, bump the StringTable pointer (which relies on the fact that the
underlying string table pointers in the session structure are contiguous —
a handy implementation detail), bump the match offset by number of entries per
string table, and try the match again.

If we found a match, we can obtain the SymbolType enum representation of the
underlying match by simply adding the match index to the match offset. I like
that. It's simple and fast. It also plays nicely with switch statements; do
your lookup, resolve the underlying enum value, and process each possible path
in a case statement like you'd do with any other integer representation of an
option.

The other nice side-effect is that it forces you to pick which table a given
string should go in. I made this decision by looking at which types occurred
most frequently, and simply put those in the first table. Less frequent types
go in subsequent tables.

I have a hunch there's a lot of mileage in that approach; that is, linear
scanning an array of string tables until a match is found. There will be an
inflection point where some form of a log(n) binary tree search will perform
better overall, but it would be very interesting to see how many strings you
need to potentially match against before that point is hit.

Unless the likelihood of matching any given string in your set is completely
random, by ordering the strings in your tables by how frequently they occur,
the amortized cost of parsing a chunk of text would be very competitive using
this approach, I would think.

A fun experiment for next time, perhaps!

Appendix

And now here's all the stuff that wasn't important enough to occur earlier in
the article.

Implementation Considerations

One issue with writing so many versions of the exact same function is... how do
you actually handle this? Downstream consumers of the component don't need to
access the 30 different function pointers for each function you've experimented
with, but things like unit tests and benchmark programs do.

Here's what I did for the StringTable component. Define two API structures, a
normal one and an "extended" one. The extended one mirrors the normal one, and
then adds all of its additional functions to the end.

I use a .def file to control the DLL function exports, with an alias to easily
control which version of a function is the official version. The main header
file then contains some bootstrap glue (in the form of an inline function) that
dynamically loads the target library and resolves the number of API methods
according to the size of the API structure provided.

This currently means that the StringTable2.dll includes all 14 C and 5 assembly
variants, which is harmless, but it does increase the size of the module
unnecessarily. (The module is currently about 19KB in size, whereas it would be
under 4KB if only the official versions were included.) What I'll probably end
up doing is setting up a second project called StringTableEx, and, in
conjunction with some #ifdefs, have that be the version of the module that
contains all the additional functions, with the normal version just containing
the official versions.

Release Build versus Profile Guided Optimization Build

It's interesting to see a side-by-side comparison of the optimized release build
next to the PGO build. The main changes are mostly all to do with branching and
jump direction.

Typedefs

If there's one thing you can't argue about with the Pascal-style Cutler Normal
Form, is that it loves a good typedef. For the sake of completeness, here's a
list of all the explicit or implied typedefs featured in the code on this page.