Unicode text is characterized by a pattern, which can be visualized on a Hex editor as shown in Figure 1. A printable character follows each zero (non-printable) character. Thus, even-offset characters are printable characters, while odd-offset characters are zeroes.

This understanding is useful for organizing textual data. Bytes at even and odd offsets could be presented sequentially. Placing printable characters before non-printable characters could result in a display shown in Figure 2.

Compression algorithms benefit from the better organization of data. Algorithms including the LZ (Lempel-Ziv) algorithm can find matches easily, for instance, while the zeroes are more compressible because the RLE (Run-Length Encoding) algorithm can be effectively applied on them.

Experimental Analysis

In this experiment, the efficiency of the pre-processor is tested using real-world and random data. Real-world data includes data from a Windows registry (.REG) file, while random data includes random printable characters. Both files have the same sizes. Six compression methods are applied to the original and pre-processed samples. Each method is manually set to the maximum compression ratio.

Real-world Data Results

Pre-processing contributes to varying degrees of compression improvements ranging from 3.72 percent to 43.87 percent as shown in Table 1. The RAR algorithm outperforms all the other algorithms for the pre-processed data. ZPAQ and ZIP:Deltate also achieve very good results. The 7z:LZMA algorithm is the most effective for the original data and still achieves good result for the pre-processed data.

Table 1: Real-world Data Results

Uncompressed

ZIP:Deflate

BZip2

7z:PPMd

7z:LZMA

RAR

ZPAQ

Original

82,097,412

2,923,818

1,926,411

1,607,819

1,116,793

2,080,856

1,205,937

Pre-processed

82,097,412

2,328,555

1,854,655

1,469,189

1,007,911

1,167,924

694,266

Gain

20.36%

3.72%

8.62%

9.75%

43.87%

42.43%

Random Data Results

Findings for the random data show that ZIP and RAR achieve 16.4 percent and 11.53 percent respectively, while the pre-processor for BZip2 and ZPAQ provide negligible improvements.

Table 2: Random Data Results

Uncompressed

ZIP:Deflate

BZip2

7z:PPMd

7z:LZMA

RAR

ZPAQ

Original

82,097,412

40,812,580

33,951,889

34,784,486

35,666,811

38,877,871

33,679,718

Pre-processed

82,097,412

34,102,961

33,947,916

34,593,752

34,268,387

34,394,031

33,661,096

Gain

16.4%

0.01%

0.55%

3.92%

11.53%

0.06%

Conclusion

The experiment provides conclusive evidence for the benefits of pre-processing printable Unicode text, which makes the data better organized and improves the overall compression ratio.