Storage and beyond

Tag Archives: column

Facebook’s scale is out of the radar for the vast majority of cases, but yet it is very interesting to lookup new ideas there.

Audience insight is a way to show us pages which were are about to like with higher probability. It is based on previously liked pages, gender, location and many other features.

In particular at Facebook’s scale it is about 35 Tb of raw data and query ‘give me several pages which should be shown to user X’ must be completed within hundreds of milliseconds. That requires to process hundreds of billions of likes and billions of pages – numbers beyond reasonable – and in fraction of a second.

It happens that 168 nodes can handle it with some magic like columnar storage, in-memory data, bitmap indexes, GPU and caches. Very interesting reading!

This closes races between data defragmentation and blob file update, but main goal was to eliminate sorted indexes. Well, now we only have single sorted index as well as sorted data, so iteration and data lookup are very fast.

This is the first step in preparation for unified iterators in Elliptics storage, which ultimate goal is to get rid of metadata and to move recovery process out of elliptics core into separate module/process. This will allow to create different recovery policies: from rsync-like copy of the whole local storage to particular key cherry-pick-like recovery.

Sorted keys in eblob also allows much faster scans and column-like storage creation. We are about to remove column support from eblob and elliptics, but not because we do not want to have it, but instead because column is actually a very different abstraction level than low-level storage. Column just doesn’t belong to that ground level.

Instead column can be trivially implemented on top of existing key mapping in elliptics. We will likely create a special API for this, but it can be already easily made by hands by properly setting up keys.