GZIP BSON in MongoDB with PHP or Node.js

I recently needed to store a large number of files (300M, 3-4TB GZIP) on a distributed number of systems, while maintaining index information about each file in MongoDB. Accessing and modify large blocks of the files in rapid succession with minimal latency made S3 a less attractive option in this case. Storing and managing the files in directories across multiple systems, or setting up and managing a completely separate storage system also seemed like a lot of work for what would eventually become archival data once the files had been processed.

The MongoDB document model offers tremendous flexibility for horizontal scaling and indexing. I have grown quite fond of it's flexible schema for ETL operations and processing distributed workloads. In many cases, the low-level client libraries and drivers give you performance benefits over web-service based storage systems that provide similar functionality.

Why not just store the files in MongoDB?

We already had a MongoDB cluster for ETL that I could scale out as needed. No extra storage system or code to manage. The files could be stored in a separate database which could later be dropped to reclaim disk space.

For all its flexibility, MongoDB does not support collection compression or any sort of storage engine (like MariaDB or MySQL) that would save disk space. The forthecoming version 2.8 aims to change that with a full set of storage APIs, but in the meantime we can handle content compression at the application level and storing files in a serialized binary field (BSON).

The following are some crude examples (minus error handling) of storing GZIP content in MongoDB using BSON encoding.

Accessing GZIP compressed files stored in MongoDB with either PHP or Node.js is pretty straight forward. If you have a similar use case where storing file content in MongoDB makes sense, compression can save a considerable amount of disk space and IO operations. This also gives you a similar access pattern for both your file contents and the indexes for that content. It’s not perfect for every scenario, but the ability to store GZIP binary content in MongoDB is incredibly useful.