Table of Contents

2017-04-17 - finding invalid characters

some time ago, in a project i work in, i needed to parse very often, some really big text files (some GBs). syntax was simple, but it still need to be checked, and ensured that the format is valid. long sotry short – majority of work boiled down to being able to quickly determine if all the characters in a sequence are from a given alphabet, known at compile-time. in other words, the task is to write a function with a signature:

bool isValid(std::string const& input);

returning true, if all characters are from a given alphabet, and false otherwise. i've played around and come up with a neat solution, that generated code much faster than the original approach. let's have a look how it evolved…

function's main loop

note that for this function can return false right away, once invalid character is found, so it can look like this:

for(auto c: input)if( not isValid(c))returnfalse;returntrue;

easy, right? yes – easy… and slow.

branch-less check

have a look at the loop. we'll talk about isValid() in a second - for now let's assume that isValid() takes no time at all. where's the potential bottle neck? there is an if statement inside the tight loop! more over, we do not honestly expect it to fail – i.e. normally isValid() will always return true! it's just a rare, edge case, where it might return false (read error or a corrupted file from god knows where). so actually we can rewrite the look like this:

now the only branch there is is the for() loop itself. not much more can be removed now.

isValid()

so let's have a look at isValid() itself. there are different approaches, that can be taken here.

it's always good to know, if there is some data-specific algorithm, that can be used. for instance, if we'd only need to accept odd characters, our function could look like this:

bool isValid(char c){return c &0x01;}

unfortunately, in my case the set was just randomly spread through ASCII alphabet - just ~10 different letters, but no useful pattern there. one must just assume “general case”. let's try out differetn approaches.

for the approaches below, i assumed pseudo-random sequence of ASCII characters (seeded with a known seed) of size 1GB. the sequence was checked and the number of errors was reported. letters being considered as “valid”, were ones that form “Just Checking” string.

switch/case

probably the first thing that comes to one's mind. something along the lines:

we can see big improvements on clang! gcc actually was a bit slower than the original if/else approach.

note however that even though we initialize the LUT once, we neeed to check if it is initialized each and every time the function is called! more over this must be done in a thread-safe manner, so maybe…

static, global LUT

just put it as a global symbol and mark const - it's always initialized, when main() is being executed and that's it!