Cloudera Hadoop components is broadcasting as binary packets and these packets are called parcels. Parcels packets distinguishes from standart packets for following priorities:

The simplicity of installation: every parcel is a file which has all components in it.

Internal agreement: All components in parcel are tested attentively, put in order and agreed among themselves. And that is why the incompatibility problem between components is very little.

Sharing and activation limit: All parcels can be installed to controlled nodes and then can be activated with an action. Therefore updating the system work fast with very little work.

Update at work time: When minor version updates, all new processes will work with this version, but old processes will continue their work with old version until they finish their work. If you want to update major version you have to stop functionality of cluster and all its work and then you can update.

Returning simple changes: It is possible to return previous version when you face with any problem by working with CDH.

The cluster with 7 nodes on Cloudera is as following screenshot:

The Installation and configuration of ClouderaTThe installati
The installation is possible on any Linux. I installed on CentOS 6.6 x64.
It is recommended of being RAID1 in Cloudera Manager server(I used default disc structure). Server resource is as follows.
Cloudera Manager: HDD 150GB, DDR 8GB, 1 CPU 4 Core
hostname: chm.unixmen.com

It is needed to separate HDD into 2 parts for Hadoop nodes. For Linux and install software’s / slice will be 100GB and for HDFS disk we will create /dfs another disk. SWAP will be 8GB
Hadoop node resources: HDD 150GB(/ -100GB,/dfs –50GB), DDR 8GB, 1 CPU 4 Core
Hostname’s:hdn1.unixmen.comhdn2.unixmen.comhdn3.unixmen.com

And now let’s start our cluster
After our works above we have no work in CLI. Open any web browser and enter http://chm.unixmen.com:7180 to open cloudera web installation. Default login and password is(admin, admin).

As enter on the right at the top push the admin -> change password button and change the password in order that nobody can login.

Then we exit and reenter with our new password and choose Cloudera Standart and push the Continue button(as in screenshot):

In the next page push the Continue button again:

We add nodes by separating comma as in screenshot with name or IP address and push the Search button. If we want to write network we can do it as 10.10.10.[2-254] rule. This syntax covers only the network(Nodes are shown in screenshot):

We mention all the nodes on the left in opened page and press Continue button (it is shown in screenshot):

We choose the next page as it is shown and press Continue button:

Then we copy id_rsa key of a pair of SSH key which we generated in CHM server before to our Windows desktop and then press Choose file button to install to server as it is shown in screenshot, in order that, when we connect to our nodes for speaking with the key PUB that was sent by itself we can use this key file. Then press Continue button:

If the installation is going on as follows, it means everything is ok:

After the installation in all nodes finishes the successful result will be as below(Press Continue to go on):

Then the parcels we have choosen will start installation and successful result will be as follows(Press Continue)

There will be testing in our nodes and successful result will be as follows(press Continue and go on):

For installation we choose all services and press Continue:

In opened page choice of database remains the default database and press Test Connection button in order to test the connection, then press Continue and go on:

On the next page everything remains as default and press Continue
The Cluster services are started. The successful result will be as follows:

Result congratulation page is printed, our work ends successfully:

At the end Our Cluster with its services must be as follows:

MapReduce – it is shared computing model which is created by Google. It is used paralel computing of big databases in computer cluster.(A few petabayt).
MapReduce – it is environment for reporting of some collection of shared exercises by using computer. It consists of two parts Map and Reduce. In the Map step start processing of incoming datas. For that one of machines (master node – JobTracker)takes the incoming task, divides them into parts and sends them for initial processing to other(ishchi node-lara Tasktracker) machines. In Reduce step the information pre-processed before is closed. The main node takes the result from worker node and form the result. The advance of MapReduce is doing exercises parallel and it is the rapid result. MapReduce is closely integrated with HDFS and fulfill the exercises in HDFS nodes where datas are saved.

HDFS – shared file system, which is shared among many nodes. The main skills are below:
* HDFS remains files in blocks and the capacity of block is at least 64MB. (This capacity in most file systems is about 4-32KB)
* HDFS in reading big sized files is very fast, but in small sized files is not affective.
* HDFS is optimized to write once to file system and is read many times.
* Instead of producing fault in disk, HDFS replicates to other disks. Each block where files arrange are saved in several cluster Nodes and HDFS NameNode is always monitoring reports sended from DataNodes, in order that blocks that are on reguired replication factor, will not be deleted when there is a fault. Each depo Node starts the process, which is called DataNode and controls the blocks on the same machine and these DataNodes is conroling by NameNode (master server)that in other machine.

HDFS and MapReduce is already in cluster as a software and has general caracters
* Each of them in cluster has coordinating(Master/Coordinator) and working( manage) architecture.
* Monitoring the processing faults and the strong of master node cluster for both software(NameNode for HDFS and MapReduce for JobTracker).
* The running process in both server(NameNode for HDFS and MapReduce for JobTracker) is respondent for completing the work on physical host, which takes the exercises from its master(NameNode for HDFS and MapReduce for JobTracker) and give information back about the strong/progress of cluster.

Hadoop uses these components in three mode as below:
* Local Free mode: It is a default mode which we show in PI example. In this case all components belong to Hadoop (NameNode, DataNode, JobTracker ve TaskTracker) work only on Java.
* Virtual–Shared mode: In this case each Hadoop component works separately on JVM process and they connect with each other by network socket. It can be called as minicluster on full functional host (in this case we have to do changes in core-site.xml, hdfs-site.xml and mapred-site.xml files).
* Full shared mode: In this mode Hadoop works in several machines which will be divided contolling and running processes(as NameNode and JobTracker).

YARN
MapReduce is almost completely changed in Hadoop-0.23 and MapReduce 2.0(MRv2) version is published or published as under the name YARN. The purpose of MRv2 was divide the two main tasks into separate daemon on JobTracker (contoling resources and planning/monitoring). The purpose was existing ApplicationMaster(AM) for each program and global ResourceManager(RM) Program is simple jobs on MapReduce or DAGs of jobs. ResourceManager exists in every slave Node, but NodeManager organizes processing of datas. ResourceManager is the last instance, which gives access to resources for programs in system. ApplicationMaster considered for each program is a created environment by special libraries, which takes the agreement of dividing of resources from ResourceManager and takes execution jobs and monitoring tasks from NodeManager.

ResourceManager is consist of two main components: Scheduler(planner) and ApplicationsManager.

Scheduler(planner) – is responsible for sharing resources of different started programs, which has similar power restriction. Scheduler (planner)does only planning of jobs and does not interfere to monitoring/tracking of any resource. It does not guarantee restarting of work if any fails in unsuccess jobs. It is planning resource needs of programs with the logic resource Conteyner. Container includes HDD,CPU,RAM and network. In the first version only RAM was.

Scheduler – has politics remotely accessible plugin, which is respondent sharing cluster resources(programs) among different queues. CapacityScheduler and FairScheduler distributors can be example as a plugin for MapReduce distributive.

CapacityScheduler – supports hierarchic queues, in order to create conditions for sharing among the known resources of cluster. ApplicationsManager is respondent to get job-tasks, come to an agreement with ApplicationsManager to execute specific programs and if there is an error it starts ApplicationMaster container again.

NodeManager – an environment agent for each machine, which is respondent container and using resources (CPU, HDD, RAM and Network) with production reports(ResourceManager/Scheduler).
Application Master is respondent for each program to come to an agreement with resource container taken from Schedule and looking at its working status.

MRV2 has API functionality that can come into compliance with previous Hadoop-1.x versions. It means that, all MapReduce jobs will work with MRv2 with only recompilation without any changes.

Correct the configuration files.
The configuration will be done in the following files:~/.bashrc/usr/local/hadoop/etc/hadoop/hadoop-env.sh/usr/local/hadoop/etc/hadoop/core-site.xml/usr/local/hadoop/etc/hadoop/mapred-site.xml.template/usr/local/hadoop/etc/hadoop/hdfs-site.xml

Correct the following changes in /usr/local/hadoop/etc/hadoop/hadoop-env.sh file:export JAVA_HOME=${JAVA_HOME}

Configure /usr/local/hadoop/etc/hadoop/core-site.xml file. This file is read when Hadoop starts and needed politics apply and starts Hadoop. Create tmp folder for Hadoop and give access to group by created user for new folder:
$ sudo mkdir -p /app/hadoop/tmp
$ sudo chown hduser:hadoop /app/hadoop/tmp

Open /usr/local/hadoop/etc/hadoop/core-site.xml file and add the following lines to among <configuration></configuration> tags:<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description>A base for other temporary directories.</description></property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value><description> Default name of file system. URI scheme and accesses define the apply of file system. Uri scheme defines (fs.SCHEME.impl) file system configuration of apply naming class. It is used to define host, port, etc. for Uri file system.</description> </property>

/usr/local/hadoop/etc/hadoop/mapred-site.xml file explains which environment will be used when MapReduce starts. Add the following lines among <configuration></configuration> tags:<property> <name>mapred.job.tracker</name> <value>localhost:54311</value> <description> MapReduce defines host and port for starting JobTracker. If you use “local” structure, so jobtracker works in order to minimize the job. </description> </property>

Configure /usr/local/hadoop/etc/hadoop/hdfs-site.xml file. This file will participate in cluster and must be configured on each node. Just in this file the folders which namenode and datanode will use are shown. Before doing changes in file we have to create two folders, which will organize datanode and namenode for installing Hadoop. Create needed folders:
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudo mkdir -p /usr/local/hadoop_store/hdfs/datanode
$ sudo chown -R hduser:hadoop /usr/local/hadoop_store

Open /usr/local/hadoop/etc/hadoop/hdfs-site.xml file and add the following lines between <configuration></configuration> tags:<property> <name>dfs.replication</name> <value>1</value> <description> Default block replication. The actual number of replication can be defined when file is created. If the number is not defined in the process of creation, the default number will be applied. </description> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop_store/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop_store/hdfs/datanode</value> </property>

$ hadoop namenode –format – Format Hadoop file system.
Note: This commant must be executed before Hadoop starts. If the command is entered again, don’t forget that the any information in content of file system will be destroyed.

To stop the Hadoop you must execute stop-all.sh script in the /usr/local/hadoop/sbin address(Or stop-dfs.sh and stop-yarn.sh scripts) At the end you can look at http://hdfsnode1.unixmen.com:50070/ by WEB

Note: In our example we used wordcount but, if you want to see the list of opportunities in mapreduce, we have to enter the hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar command.

POPULAR CATEGORY

Unixmen provide Linux Howtos, Tutorials, Tips & Tricks ,Opensource News. It cover most popular distros like Ubuntu, LinuxMint, Fedora, Centos. It is your Gate to the the world of Linux/Unix and Opensource in General.