MemoryStream Compression

Introduction

Hello, this is my first article on CodeProject. I have been a long time
reader, and the CodeProject resource has been an endless supply of answers to
many questions. After searching CodeProject, I found that the .NET section
lacked any articles on compression, so I thought I would write this article.

SharpZipLib from ICSharpCode

First of all, this article depends on the SharpZipLib which is 100% free to
use, in any sort of projects. Details on the license and download links are
available here.

Purpose

A friend asked me to teach him C#.NET, and as a project to teach him, I
decided to start writing a revision control system utilizing both server and
client, we've both had our share of pitfalls with CVS. One of the features he
wanted involved compression, so I sought out this library, but its documentation
is sketchy unless you use it purely for an API reference. Also, the
documentation only shows examples of file based compression. However, in our
project, we wanted the ability to work in memory (with custom diff-type
patches). Originally, I found this library on a forum that said this wasn't
possible, but after digging into the library documentation, I found some
Stream-oriented classes that looked promising. An hour or so of playing around,
and this simple and short code was the result. Since the code is relatively
short, I have not included any source or demo files to download. I hope someone
finds this useful!

Compression

For convenience sake, we localize the namespaces IO,
Text, and SharpZipLib:

using System;
using System.IO;
using System.Text;
using ICSharpCode.SharpZipLib.BZip2;

First of all, we'll start with compression. Since we're using
MemoryStreams, let's create a new one:

MemoryStream msCompressed = new MemoryStream();

Simple enough, right? For this example, I will use BZip2. You can use Zip, or
Tar, however, they require implementing a dummy FileEntry, which is
extra overhead that is not needed. My choice of BZip2 over GZip comes from the
experience that larger data can be compressed smaller, at the cost of a slightly
larger header (discussed below).

Pretty easy... Now however, is a good time to address the header overhead I
mentioned above. In my practical tests, compressing a 1 byte string, rendered a
28 byte overhead from the headers alone when using GZip, plus the additional
byte that could not be compressed any further. The same test with BZip2 rendered
a 36 byte overhead from the headers alone. In practice, compressing a source
file from a test project of 12892 bytes was compressed to 2563 bytes, about a
75% compression rate give or take my bad math, using BZip2. Similarly, another
test revealed 730 bytes compressed to 429 bytes. And a final test, a 174 bytes
compressed to 161 bytes.

Obviously, with any compression, the more data is available, the better the
algorithm can compress patterns.

So with that little bit of theory out of the way, back to the code... From
here, we start writing data to the BZip2OutputStream:

Pretty easy. As with most IO and stream methods, byte arrays are used instead
of strings. So we encode our output as a byte array, then write it to the
compression stream, which in turn compresses the data and writes it to the inner
stream, which is our MemoryStream.

So now, the MemoryStream contains the compressed data, so we
pull it out as a byte array and convert it back to a string. Note that this
string is NOT readable, attempting to put this string into a textbox will render
strange results. If you want to view the data, the way I did it was to convert
it into a Base64 string, but this increases the size, anyone has any suggestions
to that are welcome to comment. The result of running this specific code renders
the 43 byte uncompressed data as 74 byte compressed data, and when encoded as a
base 64 string, the final result is 100 characters as follows:

Obviously, these are not desirable results. However, I believe the speed of
which the library compresses short strings of data could be extended into a
method which returns either a compressed or uncompressed string with a flag
indicating which was more efficient.

Uncompression

Now of course, to test our code above, we need some uncompression code. I
will put all the code together, since it's pretty much the same, just using a
BZip2InputStream instead of a BZip2OutputStream, and
Read instead of Write:

Now, a quick check on sUncompressed should reveal the original
string intact... No files involved, however, if you wanted to load a file, there
are a few ways you can do it, and I leave it to your imagination.

Closing

Special thanks to the developers at ICSharpCode.Net for providing this
awesome library free to the public which makes this article possible. I have
no affiliation with ICSharpCode.Net, so I hope I have not breached anything in
posting this article.

I hope you all find this as useful as I have!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

Read() method is not implemented. It just throws the exception "not supported". The work around is as follows:
GZipOutputStream.Finish();
GZipOutputStream.Flush();
bytesBuffer = msUncompressed.GetBuffer();

Been a while since I've looked up this article. I notice some good solutions came about to peoples problems. As mentioned, I've abandoned this, however those of you interested in the compression end of things can head over and check out my new post regarding a .NET port of minilzo.

Thanks to Mathieu, who correctly pointed out the right way to solve the problems with not knowing the original uncompressed size. As mentioned before, my protocol didn't include it at this level, however it is standard technique to include the size before the compressed data as this comment suggests. Sorry for any confusion, and my obvious latent response, hope everyone found this useful.

I am getting error at
BZip2InputStream zisUncompressed = new BZip2InputStream(msUncompressed);

Object reference not set to an instance of an object.
Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

Exception Details: System.NullReferenceException: Object reference not set to
an instance of an object.

Hi,
I am able to complress files or byte arrays and put those in MemoryStream using SharpZipLib.Zip and ZipOutputStream etc. I create ZipEntry for each of these and putting them in in-memory zip. However I am having hard time extracting these files. I can uncompress the whole streama nd write to a filestream etc but that is not what I wanted. I need to be able to extract a give file from the MemoryStream. How do I do that?
On a disk based Zip, I can create ZipFile instance pointing to the Zip file name/path and use GetEntry, GetInputStream etc but when when I try to create an instance of ZipFile using ZipOutPutStream I get an error saying that the stream is not seekable.
Any help is appreciated!

Hi, I'm just about managing to get this to work, however I have a problem.

The system I'm working on runs on handheld devices, and receive chunks of data from a web service (I'm using vb.net). I'm using #ZipLib to compress the data before it is sent, to reduce the air-time costs.

When I receive the data, I obviously have no idea how big the original data was, so I don't know how big to set bytesUncompressed byteArray to.

I don't know if this is the right place to ask my question, but...
I have a compress string that I get from an out-side server. it is compressed in zip. I need to unzip it. I notice that you said a dummy file must be used. Do you know how can I do it? How can I uncompress a ziped string???

The point was already made, I simply haven't had time to fix it. It worked fine every time I tried it, my assumption was wrong on it being uncompressed prior to calling Read on the stream.

For those who want a fix, it's not hard. Prepend the compressed data with the length of the uncompressed data as was recommended in the other post. Take the first 4 bytes off, and use it for the uncompressed size.
Oddly enough, I never ran into it in final project as I included the encoded the uncompressed length elsewhere. My appologies to those who had trouble working it out.

If I get some free time I will update this post, otherwise consider it abandoned, someone else could tack on their solution in a comment.

I can only guess 1 of 2 things.
1) I assume you are using MemoryStream, but if you are not, and you are using a FileStream, then the security policy may well define (as is defaulted) to not allow certain code domains access to the filesystem. This is a known "Feature" of managed code, to prevent unauthorized access to files. The way to fix this, should be to digitally sign your code, and add the certificates to the client machines so they allow that program the access required. This assumes of course, for final projects since digital signature will change with each recompilation I believe.
2) Since the code runs, and it's only an exception, dig for the inner exception and see what is causing it. I'm sure you've stepped through the code, but try again. Have you tried running the code without being run through an IE hosted control to make sure you have it working properly? I've heard a few people report they couldn't get it working, but I'm certain all of those are end-user errors, because it works fine for me and quite a few others. The only reason I can suggest that an IE Hosted Control might throw a security exception is from attempting to access the filesystem without proper security policy. I can't help much more without detailed exception information and perhaps a chunk of code to review causing the issue.