Playing with HDP sandbox is cool. This is a quick way to see how things are going, try some Hive requests, and so on.

At one time, you may want to build your own virtual cluster on your machine : several virtual machines, making a “real” cluster to have something closest to reality.

Of course you can start from scratch, but there’s a quickest and easiest way : playing with structor.

Structor is a project hosted by Hortonworks, based on Vagrant (basically a tool who plays with VMs on VirtualBox or VMWare, dealing with OS images) and there’s a lot of forks for its functionalities to grow.
I based my stuff on Andrew Miller’s fork, with some minor modifications.

This will create a structor directory in which you’ll find basically a Vagrantfile which contains everything to build your cluster !You just have to specify aa HDP version, an Ambari version and a profile name to build the corresponding cluster.

For building my 3-nodes cluster, I just type

structor~ $ ./ambari-cluster hdp-2.1.3 ambari-1.6.1 3node-full

In the ./profiles directory you’ll find some predefined profiles : 1node-min, 1node-full, 3node-min and 3node-full, feel free to create and submit some other profiles, but remember that your VMs will need some memory to be usable ! I put 2GB for each VM, so a 3-nodes cluster will take 6GB (7GB with a proxy VM)

I added some slight modifications for VirtualBox VMs to be named as their configuration so you can have multiple ambari-clusters in VirtualBox.

Things you have to know :

1. add the hosts on /etc/hosts on your Linux or Mac, or the equivalent on Windows (look at the doc on my github)

2. the private key is stored in each VM

3. to ssh in a machine, cd to your structor directory and do a vagrant ssh HOSTNAME

We have previously defined that 5 regions would be accurate, regarding region servers number and desired regions size, and 2 basic algorithms are supplied, HexStringSplit and UniformSplit (but you can add yours).

For the record, you can create a HBase table pre-splitted if you know the repartition of your keys : either by passing SPLITS, or by providing a SPLITS_FILE which contains the points of splitting (so lines number =regions -1)
Be aware of the order, SPLITS_FILE before {…} won’t work.

Capacity scheduler allows you to create some queues at different levels, with allocating different ratio of usage.

At the beginning, you have only one queue, which is root.
All of the following is defined in conf/capacity-scheduler.xml (/etc/hadoop/conf.empty/capacity-scheduler.xml in my HDP 2.1.3) or in YARN Configs/Scheduler in Ambari.

Let’s start with two queues : a “production” and a “development” queues, which are all root subqueues

Queues definition

yarn.scheduler.capacity.root.queues=prod,dev

Now, maybe we have 2 teams in the dev department : Enginers and Datascientists.
Let split the dev queue in two sub-queues :

yarn.scheduler.capacity.root.dev.queues=eng,ds

Queues capacity

We now have to define the percentage capacity of all these queues; Note that the total must be 100 (the root capacity), if not you won’t be able to start that scheduler.

So prod will have 70% of the cluster resources and dev will have 30%.
Not really, infact ! If a job is run in dev and there’s no use of prod, then dev will take 100% of the cluster.
This make sense, because we don’t want the cluster to be under-utilized.

As you can imagine, eng will take 60% of dev capacity, and is able to reach 100% of dev if ds is empty.

We may want to limit dev to a maximum extended capacity (default is so 100%) because we never want this queue to use too much resources.
For that purpose, use the max-capacity parameter

yarn.scheduler.capacity.root.dev.max-capacity=60

Queues status

Must be set to RUNNING; If set to STOPPED then you won’t be able to submit new jobs to that queue.

Queues ACLs

The most important thing to understand is that ACLs are inherited. That means that you can’t restrain permissions, only extends them !
Most common mistake is ACLs set to * (meaning all users) on the root level : consequently, any user will be able to submit jobs to any queue : default is

yarn.scheduler.capacity.root.default.acl_submit_applications=*

ACLs format is a bit tricky :

<username>,<username>,[...]
<username>SPACE<group>,[...]

Then, on each queue, you can set 3 parameters : acl_submit_applications, acl_administer_queue and acl_administer_jobs.

When you want to test some “real” stuff, it can be useful to have a Hive table with some “big” data in it.

Let’s build a table based on Wikipedia traffic stats. It will contains page counts for every page since 2008, allowing us to have partitions too.

These stats are available here so let’s grab those files first.
This is a lot of files, here we’re getting only 2008/01 files, which are each between 20 and 30MB.

I installed axel which is a download accelerator, available on EPEL repo or with the rpm

[vagrant@gw ~]$ sudo rpm -ivh http://pkgs.repoforge.org/axel/axel-2.4-1.el6.rf.x86_64.rpm
[vagrant@gw pagecounts]$ for YEAR in {2008..2008}; do for MONTH in {01..01}; do for DAY in {01..08}; do for HOUR in {00..23}; do axel -a http://dumps.wikimedia.org/other/pagecounts-raw/${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz; done; done; done; done