The Hadoop (HDFS) balancer moves blocks around from one node to another to try to make it so each datanode has the same amount of data (within a configurable threshold). This messes up HBases’s data locality, meaning that a particular region may be serving a file that is no longer on it’s local host.

HBase’s balance_switch balances the cluster so that each regionserver hosts the same number of regions (or close to). This is separate from Hadoop’s (HDFS) balancer.

If you are running only HBase, I recommend not running Hadoop’s (HDFS) balancer as it will cause certain regions to lose their data locality. This causes any request to that region to have to go over the network to one of the datanodes that is serving it’s HFile.

HBase’s data locality is recovered though. Whenever compaction occurs, all the blocks are copied locally to the regionserver serving that region and merged. At that point, data locality is recovered for that region. With that, all you really need to do to add new nodes to the cluster is add them. Hbase will take care of rebalancing the regions, and once these regions compact data locality will be restored.

For some reason when I’m trying to use the Spark from CDH it doesn’t work with PredictionIO 0.9.3,
So I use spark 1.3.1 binary with hadoop 2.6 support and I extracted mine to: SPARK_HOME=$PIO_HOME/vendors/spark-1.3.1-bin-hadoop2.6

From CDH part I only use the HBase part as the event server storage.

I use Elasticsearch as metadata storage.

I use LocalFS as model storage.

I installed spark standalone server manually (not from cdh) (spark 1.3.1 with hadoop 2.6 support)
– For this test case I’m using a spark master with 4 workers node and let say I installed at spark://my.remote.sparkhost:7077
– If you don’t know how to install a stand alone spark server, please read the spark manual.

I know you can also use awk or some other shell commands, but perl regex is very POWERFUL and FAST.
I got this perl regex tips some time ago from stackoverflow link (i will put the link once i remember) and this method worked for me to convert the standard Tab separated output into CSV compatible 😉