This page looks at a simple BWT compression package and suggests
modifications that make it useful for a compression pass before
encryption, and hopefully will give ideas so others can improve
basic compression packages so they will compress not only better
but so that they will also be useful for encryption.

The problem with most compression methods is that they don't concern
themsevles with making the reverse compression work for all files.
This leads to the unhappy feature that if you compress your files
and then encrypt. Or worse yet as in PGP its done for you so that
you have little say in the matter. When an ememy tests by whatever
means some subset of the "keys" used in the encryption it would be
nice if many such keys lead to a false file to delay an enemy from
finding the true message. As quantum computers become more of a working
reality (the NSA may already have them) then it would be nice to use
compression encryption combinations that do not have a unique solution
so that a quantum computer does not find it. Since it seems this
particular type of computer will be best at problems where the
solution can settle down to some fixed state. At present there
are very few 1-1 compression schemes out in the open. My
huffman schemes and Matt's arithmetic coder are the only major
ones I know of. It would be nice to get one based on the BWT
so that much higher compression ratios could be achieved.
And becasue the BWT itself would prevent many forms of partial
choosen plain text attacks. At least like those that destroyted
Enigma. Since the data is scattered across several code blocks
and it would be unlikely if implemented properly that looking
at only a few blocks of encyrpted text would lead an enemy to
the correct key. The problem with most ciphers is that if an
enemy konws part of a message he may only exaime the blocks
where that part of message occurs. This is not possible if
one use a BWT compression scheme in the correct way.

This is discusion is based on Mark' BWTCODE.ZIP the file and Mark's
Page is here
The article claims that the package beats PKZIP for the Calgary Corpus
File set. It says it gets 978,122 bytes. This may be from an old
package as I got 978,129 here are the results:

But the bad news is that this compression is not even close
to bijective due to the excess bagage carried by the extra
data fields and the "?" EOD marker that is in the main data
field.
Since my concern is to use compression and eliminate waste
or added info that is useful in breaking encryption. Lets
exaime how much waste this method added to the compression.
The first problem is the BWT is not fully reversible even
if one ignores the pointer that comes out of it so one can
tell where the original data started example: ab transforms
to ba,1 and ba transforms tp ba,0 but there is no reverse
transform for ab,0 or ab,1 which means that if one is trying
to do reverse transforms on random data, one would most likely
end up with something not reversible. This problem can be
handled and reduced. I will do so later. But the question
is how much does this effect have on the compression encryption
system.
This problem has to do with cycles and when one forms the T[]
vector during the UNBWT you would generally like it to cycle through
each value without getting in a short cycle. This means that if
you look at a series of symbols and form the T[] what is the
likelyhood that the data will happen to be in a single cycle.
The worst case occurs when all symbols different. This puts
an upper limit on how often one would have to try an inverse
BWT before you get one that works. It is not possible to get
separate symbols for files as long as 2**16 bytes, but so I
can use the upper limit I will assume you can. This means that
1 in 2**16 trys would be succsessful. So you would lose about
16 bits of key strength in your encryption. My goal
for this page is to modify the added info in Mark's system
so that this is the only effect left. Getting as close as possible
through the rest of operations leading to a compressed file.

Suppose one was using Skipjack an 80 bit NSA method the government
wanted to cram down our throats. If you built a code pakage like
PGP but used Skipjack for the encryption phase and Mark's BWT
compression for the compression phase. Your in big trouble.
Beacause the extra fields alone in very first stage is enough
to make sure that only one possible decryption exists. Namely
the one you wanted to hide. In the front of file you get a
4 bytes length field. Which is a waste of space since you can
read the file and see how long it is. If you read the file you
can check exactly to see that it contains the correct 32 bits of info
( if the file is to big for you buffer then you still know what's
in the first length field so it is never needed). Lets pretend
its a short file of less than 65,000 bytes. In that case the
pointer to "last" has to point to a character in the BWT data
filed and that chacter has to be exactly "?" so thats another
32 + 8 for 72 bits specifed. Also the "first" has to have the
last 16 bits set to zero for this short file so now we are at
88 bits. We still have not tested to see if the reverse BWT
exists or if the "first" and "last" are in lock step when
the T[] is calculated. But enough bits are so specifed that
with an 80 bit key encryption scheme there is already an
overwhelming chance that there can only be one solution
to this. The one your trying to hide. And I haven't
even got into the other nonbijective aspects of this set
of compressions processes the RLE and the ARI routines are
both nonbijective.

The first improvement one can make to get rid of all the
extra fields one does not need. The EOD "?" that Mark added
can be dropped by doing a full BWT and not adding an extra
character. You only need one pointer field. You can use first
middle last your choice. I choose last. You can drop the length
field. Since even if you use small blocks. You still should
know where everything is. One can also try to use a single block
which is what I did in BWTQ. Then there is the nasty problem
of adding the Pointer Field in an optimal way such that no
information is provided to an attacker see
DSC
the source code of BWTQ and DSC along with test programs
source code and PC executables is here

Note there are still big problems here since Mark's RLE
is not 1-1 SEE MY UNADULTERATED RLE
The RLE he choose to use is very poor since it adds info to
the compressed file that is not needed except by one trying
to break the code.

At this point there are many chioces to try to make the rest of operations
1 - 1. I could use my drop in replacment for RLE as mentioned
above. But I decided I did not like the fact RLE is done twice
and that the MTF maps to the "zero symbol" why add new symbols
if there are not needed. So I created a MTFQ which just calls
the first symbol seen as "ZERO" and the second one seen as
"ONE" and so on. But other than that change of never introducing
a symbol not used, it is exacatly like MTF. The other routine
modifed here was the RLE. Everytime I think about RLE I come up
with something different. In this case for fun I decided to make
a 1-1 RLE compressor that does its best not to add new symbols
just like the MTFQ. SO I call it the A B C of RLE's since it
uses first 3 symbols seen in file for the compression it is
RLEQ. The source code and PC executables are in the file
here

Note there still are big problems here since the Arithmetic
coder is not the kind one uses when one wants to compress
before an encryption pass since it uses the Old fashion
concept of an EOF that wastes space by being carried along
and then used only once at the end of file.

In the next section dropping the old style arithmetic
coder for my orginal 1-1 huffman compressor.

The problem with the above is that an enemy can still
gain to much knowledge about wether the supplied key
can be the actually key used. The added info is so large
that in many cases this file is such that it can only
be decompressed correctly if the correct "key" was used
for the decryption. This is not good in the coming world
when quantum computers will be more common. In the non
black world IBM already has working quantum computers.
Here the result of replacing ARI with
h2com.exe

Notice even though this is a better compression for files
that will be encrypted. It only bets the Arithmetic coder
in making a smaller file for the file PAPER4.CG1 That the ARI compresses better
than a Huffman coder is not a surprise since if all other
things are equal a huffman coder is a subset of an Arithmetic
coder. I could stop here since my task of making a BWT
type of compressor for files that are to be encrypted
has ended. And it compresses smaller than the orignal
one sugguest by Mark. However it does not give
one much satisfaction that the ARI usually compresses
these Corpus files to a smaller size. What is one to do. SEE NEXT SECTION

Notice this seems to compress much better than ARI. I hoped
it would since it is not carrying extra space for an unneeded
EOF character that the old style is using. So this is a better
compression set for files that are to be encrypted and it
seems to be a better compressor than ARI all the way around.
ARI could be fixed to eliminate the wasteful EOF that it carries
but for now one can use Matt's if they want more secure
compression for files that are to be encrypted.

It is clear that if one use a raw BWT that a DSC type of program
needs to be used after it. However there are ways to make the required
pointer smaller so that the pointer value covers less range. It involves
minor changes to the BWT and UNBWT to allow more than the single cycle
of data. I may do that for next project since it seems easy to do and
any forced reduction of N by one bit means that an attacker would get
twice as many passable false files when one decrypts.

It is clear that there are many possibilites for the MTF RLE
combinations and that different types should be selected if one wishes
different versions of the programs for special files. It is not
possible to make one version compress all files. Since every
real compressor increases most files.

It is also clear that the last stage should be a compressor that
does not need a specail EOF character since not only does the
EOF character waste space but makes the job of an attacker much
easier.

One it is possible to modify a typical BWT compressor such as
Mark's to make it closer to bijective. It is not bijective yet
but will not weaken the effective encryption key used by more
than N where the number of bytes in the orginal file is 2**N

This means that quantum computers will not find a unique solution
which until now was not possible. If one was using the old BWT compression
there would like be only one solution making it much easier to break.

This close to Bijectiveness was done by the use of a
DSC routine which
combines the pointer object to the BWT transform in a optimal way.
Then all the later stages are made 1-1 with the last stage being
a bijective arithmetic coder.

The file bibwt1.zip contains the PC
executables and batch files to do the last 3 types of compressions
and their inverses for those wishing to test it.