6/09/2016

06-09-16 | Fundamentals of Modern LZ : Two-Step Parse

For some reason I feel like writing a note on this today.

A two-step parse is an enhancement to a forward-arrivals parse.

(background : forward-arrivals parse stores the minimum cost from head at each position, along with
information on the path taken to get there. At each pos P, it takes the best incoming
arrival and considers all ways to go further into the parse (literal/match/rep/etc.).
At each destination point it stores arrival_cost[P] + step cost. In simple cases
(no carried state, no entropy coding, like LZSS) the forward-arrivals parse is a perfect
parse just like the backward dynamic-programming parse. In modern LZ with carried state such as
a rep set or markov state, the forward parse is preferable.)

A two-step parse extends the standard forward-arrivals parse by being able to store an arrival from
a single coding step, or from two coding steps. The standard usage (as in LZMA/7zip) is to be able
to store a two-step arrival from the sequence { normal match, some literals, rep match }. This
multi-step arrival is stored with the cost of the whole sequence at the end point of the sequence.

If you stored *all* arrivals (not just the cheapest), you would not need two-step parse.
You could just store the first step, and then when your parse origin point advanced to the end of
the first step, it would find the second step and be able to choose it as an option.

But obviously you don't store all arrivals at each position, since the number would massively
explode, even with reduction by symmetries. (see, eg. previous articles on A* parse)

The problem arises when you have a cheap multi-step sequence, but the first step is expensive.
Then the first step might be replaced (or never filled in the first place) and the parse will not
be able to find the second step cheap option.

Let's consider a concrete example for clarity.

Parser is at pos P consider all ways to continue
At pos P there's a length 4 normal match available at offset O
It stores an arrival at [P+4] that's rather expensive
(because it has to send offset O).
At pos P+1 the parser finds a length 3 rep match
The exit from (P+1) length 3 also lands at [P+4]
This is a cheaper way to arrive at [P+4] , so the previous arrival from P via O is replaced
When the parser reaches P+4 it sees the incoming arrival as
begin a rep match match from P+1
But we missed something !
At pos P+5 (one step after the arrival) there are 2 bytes that match at offset O
if we had chosen the normal match to arrive at P+4 , we could now code a rep match
but we lost it, so we don't see the rep as an option.
Two-step to the rescue!
Back at pos P , we consider the one-step arrival :
{match len 4, offset O} to arrive at P+4
We also look after the end of that for cheap rep matches and find one.
So we store a two-step arrival :
{match len4, offset O, 1 literal, rep len 2} to arrive at P+7
Now at pos P+1 the arrival at P+4 is stomped
but the arrival at P+7 remains! So we are able to find that in the future.
The options look like :
P P+4
V V
1. MMMMLRR
2. LRRRLLL
Option 2 is cheaper at P+4
but Option 1 is cheaper at P+7

This is the primary application of two-step parse.

It's a (very limited) way of finding non-local minima in the parse search space.

The other option is "multi-parse" that stores multiple arrivals at each
position (something like 4 is typical). Multi-parse and two-step provide diminishing
returns when used together, so they usually aren't. Two-step is generally a faster way
and provides more win per CPU time, multi-parse is able to find longer-range non-local-minimum
moves and so provides more compression.

All good modern LZ's need some kind of non-local-minimum parse, because to get into a good
state for the future (typically by getting the right offset into the rep offset cache)
you may need to make a more expensive initial step.