Working with huMONGOus data

Imagine tracking and analyzing data for a million hits a day on average. Storage processing gets compounded when you have to consider past week, and even past month’s trend. In this post I will highlight factors that led us to choose MongoDB as our data store.
For analytics in Unbxd Search, we track various metrics from browser. From collected data, we infer popular products, top sales, etc. This gives us insight to the customer behavior on e-commerce site. The whole process is carried out in 3 steps, namely:

Data collection

Data aggregation

Data analysis

Analyzed data is then used for generating reports such as given below and also for tweaking search based on customer sales trend.
Analytics in Unbxd Commerce Search dashboard
Let us visit reasons that led us to adopt MongoDB:

Faster reads
MongoDB supports faster reads which is crucial for analytics platform. The read to write ratio for Unbxd analytics platform is high. We also observed that we have few writes during data collection and storage of aggregated results as compared to multiple retrieval of results for reports. It is a typical write once and read many scenario.

Although writes are slower, but at our end we use load balanced message queues to handle outburst of data coming to MongoDB.

In-built Aggregation framework
Apart from slower map-reduce method, MongoDB provides group-by sub-routines, which are very useful considering operations they support. Sub-routines provided in aggregation framework makes life easier for developers. Aggregation tasks such as group-by, total count, average, sorting etc. are handled by single function call. Majority of data grouping and processing is covered using these routines and remaining code needs to be written for processing over and above this.

Schema-less & Easier to modify any table or collection

Each collection (equivalent of a table in MySQL) in MongoDB stores data as key-value pairs. Now consider a scenario where you are tracking products clicked by users of e-commerce site, and currently you log only productId & search query. Going forward in future we may need to log another attribute say productName with each click. Now, with MongoDB in picture, we can simply add an extra attribute and MongoDB dynamically starts pushing an extra parameter in collections. Neither is there any need to make changes in our data collection code, nor do we need to fire alter table commands as in relational databases.

As seen above, with just 3 statements we are ready to perform operations for table-‘table1’. Several features like, dynamic database creation, upsert operation {single call updates existing data, otherwise inserts new data} make it quicker and easier to code your DB operations.

JSON representation

MongoDB stores data in JSON format. This favors representation of relations and hierarchy in stored data. We may have collected data or aggregated results which need to be stored as nested objects, arrays, lists etc. MongoDB supports similar storage in a single document (equivalent of a row in MySQL).

Hadoop Adapter

In future, you may have huge data and may want to move out data processing. In such situations, maintaining MongoDB as your data store, either for input or results, you may push processing of data to Hadoop using MongoDB’s Hadoop adapter.

In a subsequent post, I will go into the details of the Unbxd analytics platform. For now, let’s keep the discussion alive in the comments section below or email me at: udit AT unbxd DOT com.