9/27/2011

09-27-11 - String Match Stress Test

Take a decent size file like "book1" , do :

copy /b book1 + book1 twobooks

then test on "twobooks".

There are three general classes of how string matchers respond to a case like "twobooks" :

1. No problemo. Time per byte is roughly constant no matter what you throw at it (for both greedy and
non-greedy parsing).
This class is basically only made up of matchers that have a correct "follows" implementation.

2. Okay with greedy parsing. This class craps out in some kind of O(N^2) way if you ask them to match at
every position, but if you let them do greedy matching they are okay. This class does not have a correct
"follows" implementation, but does otherwise avoid O(N^2) behavior. For example MMC seems to fall into this
class, as does a suffix tree without "follows".

Any matcher with a small constant number of maximum compares
can fall into this performance class, but at the cost of an unknown amount of match quality.

3. Craps out even with greedy parsing. This class fails to avoid O(N^2) trap that happens when you have a long
match and also many ways to make it. For example simple hash chains without an "amortize" limit fall in this
class. (with non-greedy parsing they are O(N^3) on degenerate cases like a file that's all the same char).

Two other interesting stress tests I'm using are :

Inspired by ryg, "stress_suffix_forward" :

4k of aaaaa...
then paper1
then 64k of aaaa...

obviously when you first reach the second part of "aaaa..." you need to find the beginning of the file,
but a naive suffix sort will have to look through 64k of following a's before it finds it.

Another useful one to check on the "amortize" behavior is "stress_search_limit" :

book1
then, 1000 times :
128 random bytes
the first 128 bytes of book1
book1 again

obviously when you encounter all of book1 for the second time, you should match the whole book1 at the
head of the file, but matchers which use some kind of simple search limit (eg. amortized hashing)
will see the 128 byte matches first and may never get back to the really long one.

4 comments:

I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff--isn't it just testing whether you can find long matches at all? There shouldn't be anything O(N^2) about it, should there?

(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)

" I understand how the later tests are testing for O(N^2) traps and etc., but I don't understand how twobooks is relevant to that stuff ... "

Yeah, there is. (this is for non-greedy parsing only). That's why twobooks is interesting, it isolates one specific O(N^2) case that is rarely handled.

Any string matcher that does not have a "follows" implementation will come to the second occurance of book1 and do something like this :

find the longest match, which is N characters, which takes N character compares in the strcmp

go to the next pos, find the longest match, that takes N-1 character compares

etc. N+(N-1)+(N-2) ... = O(N^2)

You can fail to see these traps if you think of the string compare as being O(1).

"(Also, if you're making a compressor that operates on, say, independent 256KB blocks, it's kind of an irrelevant test.)"

You could take the first 128k of book1 and do the same test.

But really it applies to any long match scenario in non-greedy parsing.

Obviously most people (such as myself in the exist rrLZH code) limit their exposure to these kind of traps by doing things like bailing out of optimal parsing when they see a match longer than 256 or whatever.

Of course a match of length K at position N always implies a match of K-1 at position N+1, so you could probably avoid this specific case pretty easily (and optimizing that is useful in general for long matches, not just in this specific case).

Yeah, true, "twobooks" is actually a very easy to fix degenerate case of this problem. You can patch a solution to it onto any underlying string matcher just by tracking {lastoffset,lastml} and checking them first.

But that doesn't really solve the underlying problem. It's sort of like patching on handling of the degenerate RLE case, that doesn't fix the ababab case, etc.

You can make a trickier "follows" stress test by repeating a long common prefix which is then followed by a variety of things. And also suffixes of the long prefix, such that the simple {lastml,lastoffset} no longer gives you the best match.