A virtue of the transform is that BWT(S) is likely to
be compressible by quite simple data-compression methods,
unless S is inherently incompressible.
For example, suppose that some word, w=w0...wk-1,
occurs frequently in S.
Then all of the suffixes starting w1...wk-1...
will be sorted in a "block" and there will be a
corresponding "run" of w0 in BWT(S).
Similarly for suffixes starting w2...wk-1...
and runs of w1, and so on.

Recovering S from BWT(S)

Surprisingly, it is possible to recover S from BWT(S).

The first symbol of BWT(S), here t, must be the last symbol of S because
the empty suffix ($) is the first suffix.

That t starts the suffix t$, at position 11 = rank(t)
in the sorted list.

It must be preceded in S by symbol 11 of BWT(S), c.

Suffix ct$ must be in
the range [rank(c), rank(g)) = [5, 8)
in the sorted list; in fact it must be at position 7,
because two other cs precede it in BWT(S).

This is equivalent to a search on an implicit trie (or similar)
of reversed prefixes.
Note that there is no actual explicit trie but
the BWT can be used as a substitue for one, and
a range, [lo, hi), created during search,
acts like a node in the implicit trie:

sorted reverseprefixes

~ trie

agcagcagact$

$

gcagcagact$a

$

a

gcagact$agca

$

agc

gact$agcagca

$agc

ct$agcagcaga

$agcagcag

t$agcagcagac

$agcagcaga

c

agcagact$agc

$

ag

agact$agcagc

$agc

cagcagact$ag

$

ag

cagact$agcag

$

agc

act$agcagcag

$agc

$agcagcagact

$agcagcagact

For example, gca occurs twice in agcagcagact.

Search

The locations of occurrences, if any, of a pattern, pat, in S
can be found by first finding the multiplicity of pat in S
and then locating the positions, in S, that correspond to
each BWT(S) position in the final range, [lo, hi).
In other words the suffix array,
the suffix# column above,
can be reconstructed when needed.

The BWA [LD09] implementation,
which is for DNA,
divides BWT(S) up into equal sized intervals.
These will not in general correspond to equal sized intervals
in S so the worst-case time-complexity of BWA's search
may be much slower than the best case.
In contrast, the "FM-index" [FM00] gives better performance guarantees.
It divides S, rather than BWT(S),
up into equal sized intervals (buckets), and
one of these boundaries must be encountered in bounded time, but
this requires more sophisticated data structures.

Approximate Search

Exact search (above) is equivalent to descending a trie of
reversed prefixes as directed by a reverse scan of the search pattern.
The BW-transform of S acts as a compact substitute for the trie.
Just as the trie can be traversed for approximate search,
given an "allowance" of mutations between the pattern and
what is considered an approximate instance of it,
so approximate search can be performed [LD09]
using BWT(S) as a substitute for the trie.

Note that (approximately-) matching instances of the pattern in S
may overlap and even have common start and/or end positions.

A further refinment is that approximate search can be pruned.
The search above fails when another error is required but
the allowance has been used up.
It is possible [LD09] to estimate a useful lower bound on the
number of errors required, at a given position in the pattern,
to successfully complete an approximate search yielding
at least one approximate match.