Steps taken to producing a 4 node Hadoop 2 cluster on Raspberry pis.

Purpose of project

The whole reason I’m going through the motions and doing this basic “Hello World” type of exercise, is that I believe distributed computing is going to become more and more a distinguishing mark on any data engineer’s trade. I’ve already felt it in my last job search – the lack of Hadoop / Spark / distributed computing experience. My journey began in simple single node typical RDBMS environments (MySQL), and then a share-nothing MPP solution (Postgresql flavored) – this last step started to open my eyes to the benefits of distributed databases. The next step, I think, for me is to dip my toe into the Hadoop / distributed computing world. Even if I’m wrong, and I can have a solid career without the Hadoop experience, I still think it’s a good challenge and good to be exposed to current trends in data engineering technologies – even if I’m more than a few years behind the times.

Set up each Raspberry Pi unit

I’m going to be setting up my NameNode first and then I’ll ssh into all the other boxes from there. As such, I’ll need to set up
Boots to Desktop OS initially, I open up a terminal session and run raspi-config
* Expand Filessystem
* Boot options -> Desktop/CLI -> Console Autologin
* Advanced Options -> SSH
* Advanced Options -> Hostname -> “pi-hdp01”

At this point I also identified the IP address of this particular unit.

Install protobuf 2.5.0

wget https://github.com/google/protobuf/releases/download/v2.5.0/protobuf-2.5.0.tar.gz
tar xzvf protobuf-2.5.0.tar.gz
cd protobuf-2.5.0
./configure --prefix=/usr
make # I don’t typically run these make commands, so this seemed like it took forever (10 minutes?) and I kept getting the same weird messages over and over
make check
sudo make install

I think the above was unnecessary since I ended up downloading the binary files instead of building from source.

Install Hadoop

At this point, I broke away from the tutorial I was following which tried to build Hadoop from source with Make and all that craziness. I’m just downloading the binary and copying to /opt as per another tutorial

Now I’m going to try running the same MapReduce job they’ve provided in the examples, counting the occurrences of that “dfs” string:
First, bring the distributed file store back up:sbin/start-dfs.sh
I’ve been following this blog on and off, and found this “jps” command super helpful:

There were a lot of “failures” and it took a lot longer than the other single node and pseudo distributed modes, not sure why – but this is also a pretty simple example and not a typical use case, I’m sure.