Introduction

Hello, this is my first article on CodeProject. I have been a long time
reader, and the CodeProject resource has been an endless supply of answers to
many questions. After searching CodeProject, I found that the .NET section
lacked any articles on compression, so I thought I would write this article.

SharpZipLib from ICSharpCode

First of all, this article depends on the SharpZipLib which is 100% free to
use, in any sort of projects. Details on the license and download links are
available here.

Purpose

A friend asked me to teach him C#.NET, and as a project to teach him, I
decided to start writing a revision control system utilizing both server and
client, we've both had our share of pitfalls with CVS. One of the features he
wanted involved compression, so I sought out this library, but its documentation
is sketchy unless you use it purely for an API reference. Also, the
documentation only shows examples of file based compression. However, in our
project, we wanted the ability to work in memory (with custom diff-type
patches). Originally, I found this library on a forum that said this wasn't
possible, but after digging into the library documentation, I found some
Stream-oriented classes that looked promising. An hour or so of playing around,
and this simple and short code was the result. Since the code is relatively
short, I have not included any source or demo files to download. I hope someone
finds this useful!

Compression

For convenience sake, we localize the namespaces IO,
Text, and SharpZipLib:

using System;
using System.IO;
using System.Text;
using ICSharpCode.SharpZipLib.BZip2;

First of all, we'll start with compression. Since we're using
MemoryStreams, let's create a new one:

MemoryStream msCompressed = new MemoryStream();

Simple enough, right? For this example, I will use BZip2. You can use Zip, or
Tar, however, they require implementing a dummy FileEntry, which is
extra overhead that is not needed. My choice of BZip2 over GZip comes from the
experience that larger data can be compressed smaller, at the cost of a slightly
larger header (discussed below).

Pretty easy... Now however, is a good time to address the header overhead I
mentioned above. In my practical tests, compressing a 1 byte string, rendered a
28 byte overhead from the headers alone when using GZip, plus the additional
byte that could not be compressed any further. The same test with BZip2 rendered
a 36 byte overhead from the headers alone. In practice, compressing a source
file from a test project of 12892 bytes was compressed to 2563 bytes, about a
75% compression rate give or take my bad math, using BZip2. Similarly, another
test revealed 730 bytes compressed to 429 bytes. And a final test, a 174 bytes
compressed to 161 bytes.

Obviously, with any compression, the more data is available, the better the
algorithm can compress patterns.

So with that little bit of theory out of the way, back to the code... From
here, we start writing data to the BZip2OutputStream:

Pretty easy. As with most IO and stream methods, byte arrays are used instead
of strings. So we encode our output as a byte array, then write it to the
compression stream, which in turn compresses the data and writes it to the inner
stream, which is our MemoryStream.

So now, the MemoryStream contains the compressed data, so we
pull it out as a byte array and convert it back to a string. Note that this
string is NOT readable, attempting to put this string into a textbox will render
strange results. If you want to view the data, the way I did it was to convert
it into a Base64 string, but this increases the size, anyone has any suggestions
to that are welcome to comment. The result of running this specific code renders
the 43 byte uncompressed data as 74 byte compressed data, and when encoded as a
base 64 string, the final result is 100 characters as follows:

Obviously, these are not desirable results. However, I believe the speed of
which the library compresses short strings of data could be extended into a
method which returns either a compressed or uncompressed string with a flag
indicating which was more efficient.

Uncompression

Now of course, to test our code above, we need some uncompression code. I
will put all the code together, since it's pretty much the same, just using a
BZip2InputStream instead of a BZip2OutputStream, and
Read instead of Write:

Now, a quick check on sUncompressed should reveal the original
string intact... No files involved, however, if you wanted to load a file, there
are a few ways you can do it, and I leave it to your imagination.

Closing

Special thanks to the developers at ICSharpCode.Net for providing this
awesome library free to the public which makes this article possible. I have
no affiliation with ICSharpCode.Net, so I hope I have not breached anything in
posting this article.

I hope you all find this as useful as I have!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

I can only guess 1 of 2 things.
1) I assume you are using MemoryStream, but if you are not, and you are using a FileStream, then the security policy may well define (as is defaulted) to not allow certain code domains access to the filesystem. This is a known "Feature" of managed code, to prevent unauthorized access to files. The way to fix this, should be to digitally sign your code, and add the certificates to the client machines so they allow that program the access required. This assumes of course, for final projects since digital signature will change with each recompilation I believe.
2) Since the code runs, and it's only an exception, dig for the inner exception and see what is causing it. I'm sure you've stepped through the code, but try again. Have you tried running the code without being run through an IE hosted control to make sure you have it working properly? I've heard a few people report they couldn't get it working, but I'm certain all of those are end-user errors, because it works fine for me and quite a few others. The only reason I can suggest that an IE Hosted Control might throw a security exception is from attempting to access the filesystem without proper security policy. I can't help much more without detailed exception information and perhaps a chunk of code to review causing the issue.

The way the security works in the IE hosted controls is a little bit odd.

As soon as a function which calls another assembly is called, the framework loads the assembly and performs a security check for the entire assembly, regardless of what class is being used.

Notice that this check happens before entering the function which calls the assembly, and not before the call to a function in the assembly. So, not even the first line of the outer function is called. Something similar happens when trying to use remoting.

In this particular case some portion of the ICSharpZip uses file access functions, so the entire assembly fails to load on the IE hosted control.

The exception throw is a plain uninformative security exception without any message and without an inner exception.

Using the source code and making an assembly just with the compression and uncompression routines, works fine. But I don't know if that is legal.

Ahha, nice catch Eric. It never even occured to me that at a global level, they may reserve files, or open them strictly for the purpose of throwing the exception.
In my opinion, this is kinda strange, since the library works fine using MemoryStreams, there should not be a dependancy on any files.
Upon searching for a little more information now, I see that there is more information on this subject popping up. See here for something by someone else:

http://weblogs.asp.net/cfranklin/archive/2003/12/13/43355.aspx

Again, his code relies back on ICSharpZip library.

As far as the terms on the ICSharpZip library, here is something from their website:

"Linking this library statically or dynamically with other modules is making a combined work based on this library. Thus, the terms and conditions of the GNU General Public License cover the whole combination."

"As a special exception, the copyright holders of this library give you permission to link this library with independent modules to produce an executable, regardless of the license terms of these independent modules, and to copy and distribute the resulting executable under terms of your choice, provided that you also meet, for each linked independent module, the terms and conditions of the license of that module. An independent module is a module which is not derived from or based on this library. If you modify this library, you may extend this exception to your version of the library, but you are not obligated to do so. If you do not wish to do so, delete this exception statement from your version."

In simplest terms, this means you can link the library as-is in commercial projects. If you choose to strip out functionality from the library to support what you require, then your project must be liscensed under the same work as ICSharpZip itself, which is the GNU GPL.

[edit]
After reading the terms closer, I think you can modify their code, strip out what you need, and still use it in commercial projects. To be exact: "If you modify this library, you may extend this exception to your version of the library, but you are not obligated to do so."
In that case, you're well within your rights to strip out the troublesome file IO, cut back to only the compression/decompression routines you require and probably gain a little speed dropping the excessive IO. You said it works, great job, sounds like a worthwhile effort to strip out and profile those routines and redistribute the library under the same liscense.
[edit]

If your work is a commercial project, in this specific case, I would contact the author of ICSharpZip, explain the situation, and offer a slimmed down, memory-only/efficient version that they may be willing to release as a DLL which can be linked in commercial projects. If it were me, I'd happily do it, because they have crippled their own library and not realized it because they probably don't use remote IE hosted controls.

Kudo's on finding the problem in their library. And thanks for also confirming the code otherwise works for you. A lot of people have experienced some strange problems that I haven't been able to replicate.

I have been thinking of writing a completely opensource public domain compression library. If you are interested in getting involved, I would appreciate someone with your experience on the matter. Feel free to email me.

Hello, I hope you got it working, but in case you haven't, I'd like to say that this code works as-is, if you copy/paste it. The code here is copied directly from a test program I was working on that still works right now. Of course, apply it to your own needs, but the code works fine. If you are having a problem with allocating new memory, I suggest encasing the problematic code into a try/catch block, and see what exception is being thrown.
If you still can't figure it out, I will write another test program and zip up the code for an example, and upload it. I have tested this on 2 machines, running the 1.1 framework and no problems... Catch your exception and post it, that would be much more helpful in tracking down the issue.

Some additional details would be helpful... Does the program stall before it fails on this line? Any warnings at all during the build? How big is the compressed/uncompressed data in question? Does compressing work for you? If you used another program to compress the data, what was it? (may need to contact the ziplib people if there is a bug in their library)

If it's running under a WinForms app, from the IDE, the app "hangs" and only Debug -> Stop Debugging helps, even Ctrl+Break doesn't.
If it's running under ASP.NET, the aspnet_wp process utilizes the CPU to its maximum.

Because I can't attach things in this forum, I uploaded my program to: http://www.pixiesoft.com/compression.zip so you can see it.
2 notes:
1) It's in VB.NET, but it's a good conversion from C#.
2) You need to update the reference to the SharpZipLib.

I'm sorry I have not had a chance to respond sooner. I also do not have Visual Studio .NET currently available to me, so your code is lost on me at the moment.

However, someone else did have a similar problem, and what they found they had to do, was reset the memorystream back to the beginning, because they were reusing the same memory stream or something like that. Have you tried to catch any exceptions? It may be that when it hangs for you, it's actually producing an exception the OS isn't catching. However, in both cases it sounds like the SharpZipLib has gone into limbo trying to decompress a memorystream from an invalid offset.

Here is what I think is occuring.

1) You compress successfully to the memory stream.
2a) You do not pull the data out to a byte array
2b) You do not reset the memorystream
3) You attempt to uncompress and one of the following occurs:
4a) You attempt to uncompress the old memory stream using the stream, not the byte array
4b) You didn't reset the memorystream, so the uncompressed data is at the tail of the memorystream
4c) You reset the memorystream, and attempted to uncompress into the same memory stream
5) CPU in any of these cases, goes into limbo because Streams work based on dataavailable, and block syncronously until it's available otherwise. My guess is the SharpLib has been unable to properly decompress the data, and it's related to the memorystream being used incorrectly.
6) If all else fails, trap everything, don't reuse memory streams, create new objects for everything, expand your code to the fullest, and step through making sure data is intact at every point. This code has been confirmed by a number of people that it does work if you follow it closely and once it works, expand it to your needs.

Be aware, that another poster has informed me that there is a vanilla security exception thrown when attempting to load the CSharpZip library into an IE Hosted Control, as the library attempts to open files upon linking, before any calls to actual methods of the DLL. The gentleman has found a work around involving stripping minimal functionality from the library, we can only hope he's kind enough to rerelease the routines tailored for memory compression/decompression. Check the other posts here if you want to contact him.

Thanks for the detailed answer.
I'm afraid that none of your assumptions exist in my code.
I created a very minimal program that is intended to perform ONLY the task of compression and then decompression - it is the exact same code that you show here in your article, only translated to VB.NET .
Objects are created as noted - I don't reuse the MemoryStream neither the SharpZipLib object.

I experienced the same problem. The cause isn't related to the compression or the instantiation of the compression objects. The problem is in rendering the byte array of compressed data into a string.

No matter how you alter the original string, you might note that the bytes of compressed data as rendered in the Immediate window are always

"BZh91AY&SYd"

The non-printing character at the end seemed strange to me as it appears to trail the final quote mark. The entire also seemed too short to me especially after I doubled the original string from "This represents some data being compressed." to "This represents some data being compressed. This represents some data being compressed." This was confirmed when I checked the length of the sCompressed variable, 74 bytes.

So I took a look at the byte array in the Locals window and noticed that elements 14 and 15 are zero. I believe when the Encoding.ASCII.GetString(bytesBuffer) is rendering the strings it is treating the zero bytes as nulls in a null-terminated string. The remainder of the bytes are regarded as garbage.

To confirm, I tried the compression and stopped after createing the byte array. Then I altered the decompression algorithm to accept the byte array arguement and start from there. In short, I skipped the conversion to and from the string variable. It works fine every time.

You are absolutely, 100 percent correct. To me, it was obviously apparant because my debugger informed me that the byte array contained the correct number of elements. The solution I thought of and recommended, was encoding the encrypted text using Base64. This will prevent your bytes which are 0 in the array, to be encoding in a fashion that a string can utilize correctly.

However, do realize that by using Base 64 encoding, if you are not compressing a significant ammount of data, you will probably lose even more compression rate, because after you compress the data, Base64 will expand certain bytes into 2 bytes, for encoding purposes. Granted, you think your string is generally "proper text" but once encrypted, the text becomes unknown and as TLang suggested here, 2 bytes in the early part of the string immediately were needed to be encoded, if not more. The end result is, after encrypting, if you haven't saved 50 percent of the space, you may well lose any effect of encryption to begin with.

I am not sure if any other more efficient means of encoding exist specifically for such a case. But if not, it would not be difficult to create your own encoding method that replaces all characters of say, ~, with ~~, and all other characters that require an escape to be ~something codes. Example from above would appear as:

"BZh91AY&SY~0d~0" .... And so on.

In reverse, you decode ~ based codes and then decompress the data. At the VERY least, you should only need to deal with bytes of the value 0, which indicate the NULL character in C, and therefore result at a low level (as you notice in the debugger) to terminate the string prematurely.

Hope this helps some, glad you found the article useful, you seem to be the first with positive feedback

I have since dropped the project which involved the compression routines I was writing, so I will no longer be supporting this article, which is why I sent in a compiled program to attach to the article which should clear up any problems people were having. I appologize, but if people cannot get it working with my demo project now, I suggest they give up The demo project utilizes the Base64 encoding method.

Actually, I spent the time required to figure out what was actually wrong with the example project. The problem is:

The "cmdUncompress_Click" method uses the length of the compressed data to determine how many bytes should be uncompressed and extracted. Therefore, when the number of characters in the original data is more than the number of characters of the compressed data (which occurs with any non-trivial original data), the original data appears to be truncated when it is uncompressed.

The example code can be fixed if one keeps track of the length of the original data and uses that length to uncompress the data -- or a tight loop that uncompresses a few bytes at a time until the compressed stream is fully uncompressed is used.