Apache Hadoop is an open source framework that allows for distributed processing of large data sets across computing clusters

Apache Hadoop has these major projects:
Hadoop Distributed File System (HDFS): A distributed File System for high-throughput access to large sets of data

Hadoop Common: The common utilities that support the other Hadoop modules.

Hadoop YARN: A framework for job scheduling and cluster “resource management”. YARN is a generic platform that run any distributed applicaiton and MR2 is a distributed application that run on top of YARN.

MapReduce 2 : A YARN-based system for parallel processing of large sets of data

Thanks to http://www.youtube.com/playlist?list=PL9ooVrP1hQOHrhnO86Z9m9tDi91W2d1b6

How Hadoop processes data (MapReduce:- analogy is java servlet)
1) Hadoop provides framework MapReduce for processing the stored big data. The important innovation of MapReduce
is the ability to take a query over a dataset, divide it, and run it in parallel over multiple nodes.

2) You can run your indexing job by sending your code to each of the dozens of servers in your cluster,
and each server operates on its own little piece of the data. Results are then delivered back to you in a unified whole.
(MapReduce)you map the operation out to all of those servers and then you reduce the results back into a single result set.

How Hadoop stores files (HDFS)
1) Hadoop lets you store files bigger than what can be stored on one particular node or server.

2) When you want to load all of your organization’s data into Hadoop, what the software does is bust
that data into pieces that it then spreads across your different servers. There’s no one place where
you go to talk to all of your data; Hadoop keeps track of where the data resides. And because there are
multiple copy stores, data stored on a server that goes offline or dies can be automatically replicated
from a known good copy.

3) Hadoop is designed to run on a large number of machines that don’t share any memory or disks.

4) Each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System.
HDFS ensures data is replicated with redundancy across the cluster.

Hadoop programming
Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
Working directly with Java APIs can be tedious and error prone. It also restricts usage of Hadoop to Java programmers. Hadoop offers two solutions for making Hadoop programming easier. PIG and HIVE

1) Pig is a programming language that simplifies the common tasks of working with Hadoop: loading data, expressing transformations on the data, and storing the final results.
2) Hive enables Hadoop to operate as a data warehouse. It superimposes structure on data in HDFS and then permits queries over the data using a familiar SQL-like syntax.

“Continuous Integration” is a development practice that requires developers to integrate code into a shared repository time to times. Each check-in is then verified by an automated build, allowing teams to detect problems early. Apart from automating the builds we can automate the deployment and test.

1) A Cinder volume can be used as the boot disk for a Cloud instance; in that scenario, an ephemeral disk is not required.
2) Cinder allows block devices to be exposed and connected to compute instances for expanded stroage, better performance and integration with different storage platform like SolidFire.

Modify the ~/.ssh/config file of your ceph-deploy admin node so that it logs in to Ceph Nodes as the user you created (e.g., ceph).

Create a directory on your admin node for maintaining the configuration that ceph-deploy generates for your cluster. Run all admin commands from this directory. Do not use sudo for any ceph-deploy command.

mkdir my-cluster

cd my-cluster

Create Base Ceph cluster from Admin node. On your admin node from the directory you created for holding your configuration file, perform the following steps using ceph-deploy.

Create the cluster.

ceph-deploy new {initial-monitor-node(s)} i.e. ceph-deploy new node1

** Check the output of ceph-deploy with ls and cat in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file for the new cluster. See ceph-deploy new -h for additional details.

** If you have more than one network interface, add the public network setting under the [global] section of your Ceph configuration file. See the Network Configuration Reference for details.

public network = {ip-address}/{netmask}

N.B. to get CIDR format of network run “ip route list”

Install Ceph.

ceph-deploy install –no-adjust-repos node1 node2 node3

Add the initial monitor(s) and gather the keys

ceph-deploy mon create-initial

** Once you complete the process, your local directory should have the following keyrings:

{cluster-name}.client.admin.keyring

{cluster-name}.bootstrap-osd.keyring

{cluster-name}.bootstrap-mds.keyring

Add two OSDs. For fast setup, this quick start uses a directory rather than an entire disk per Ceph OSD Daemon.

ssh node2

sudo mkdir /var/local/osd0

exit

ssh node3

sudo mkdir /var/local/osd1

exit

From admin node, use ceph-deploy to prepare the OSDs.

ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1

Activate the OSDs.

ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1

Use ceph-deploy to copy the configuration file and admin key to your admin node and your Ceph Nodes so that you can use the ceph CLI without having to specify the monitor address and ceph.client.admin.keyring each time you execute a command.

ceph-deploy admin node1 node2 node3 admin-node

Ensure that you have the correct permissions for the ceph.client.admin.keyring.

sudo chmod +r /etc/ceph/ceph.client.admin.keyring

Check your cluster’s health.

ceph health

**Your cluster should return an active + clean state when it has finished peering.

Create Ceph Block device using QENU

QEMU/KVM can interact with Ceph Block Devices via librbd.

Install virtualization stack for Ceph Block storage on client node

sudo apt-get install qemu

sudo apt-get update && sudo apt-get install libvirt-bin

To configure Ceph for use with libvirt, perform the following steps:

Create a pool (or use the default). The following example uses the pool name libvirt-pool with 128 placement groups.

ceph osd pool create libvirt-pool 128 128

Verify the pool exists.

ceph osd lspools

Verify if client.admin exists in “ceph auth list” command

** libvirt will access Ceph using the ID libvirt, not the Ceph name client.libvirt. See Cephx Commandline for detailed explanation of the difference between ID and name.

Use QEMU to create an “block device image” in your RBD pool.

qemu-img create -f rbd rbd:libvirt-pool/new-libvirt-image 2G

Verify the image exists.

rbd -p libvirt-pool ls

** You can also use rbd create to create an image, but we recommend ensuring that QEMU is working properly.

By using Crush algorithm to store and retrieve data, we can avoid a single point of failure and scale easily.

Data Placement strategy in ceph has two parts placement groups and the Crush map.

Each object must belong to some placement group.

Ceph Clients (RADOS / librados) and Ceph OSD both use the CRUSH algorithm to efficiently compute information about data containers on demand, instead of having to depend on broker.

Ceph OSD Daemons create object replicas on other Ceph Nodes to ensure data safety and high availability. This replication is synchronous, such that a new or updated object guarantees its availability before an application is notified that the write has completed.

Pools are logical partitions for storing objects.Ceph clusters have the concept of pools, where each pool has a certain number of placement groups. Placement groups are just collections of mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, based on the replication level you set when you make the pool. When an object gets written to the cluster, CRUSH will determine which PG the data should be sent to. The data will first hit the primary OSD and then replicated out to the other OSDs in the same placement group. Ceph Clients retrieve a latest Cluster Map from a Ceph Monitor, and write objects to pools.

Currently reads always come from the primary OSD in the placement group rather than a secondary even if the secondary is closer to the client. In many cases spreading reads out over all of the OSDs in the cluster is better than trying to optimize reads to only hit local OSDs.

NB

Could be a potential bottleneck if lot of clients want to read the same file, all requests will land on the same OSD though other replica OSDs are lying idle.

The only input required by the client is the object ID and the pool.

7.2 Physical placement

CRUSH algorithm maps each object to a placement group and then maps each placement group to one or more Ceph OSD Daemons.

Ceph client uses the CRUSH algorithm to compute where to store an object, maps the object to a pool and placement group, then looks at the CRUSH map to identify the primary OSD for the placement group.

o Total PGs = (OSD *100) /Replicas {rounded up to the nearest power of 2}

o CRUSH calculates the hash modulo the number of PGs (e.g., 0x58) to get a PG ID.

o CRUSH gets the pool ID given the pool name (e.g., “liverpool” = 4)

o CRUSH prepends the pool ID to the PG ID (e.g., 4.0×58).

With a copy of the cluster map and the CRUSH algorithm, the client can compute exactly which OSD to use when reading or writing a particular object.

Replication is always executed at the PG level: All objects of a placement group are replicated between different OSDs in the RADOS cluster.

An object ID is unique across the entire cluster, not just the local file system.

The client writes the object to the identified placement group in the primary OSD. Then, the primary OSD with its own copy of the CRUSH map identifies the secondary and tertiary OSDs for replication purposes, and replicates the object to the appropriate placement groups in the secondary and tertiary OSDs (as many OSDs as additional replicas), and responds to the client once it has confirmed the object was stored successfully.

Pools:- A pool differs from CRUSH’s location-based buckets in that a pool doesn’t have a single physical location, and a pool provides you with some additional functionality, including replicas, Placement groups, Crush rules, snapshots and setting ownership.

So I have 1 SSD PER storage node for journaling?Ans: Not necessarily. It depends on a number of factors. In some cases that may be sufficient, and in others the SSD can become a bottleneck and rapidly weak out. Different applications will have a different ideal ratio of SSD journals to spinning disks, taking into account rate of write IO and bandwidth requirements for the node.

What happens in case of a big file (for example, 100MB) with multiple chunks? Is ceph smart enough to read multiple chunks from multiple servers simultaneously or the whole file will be served by just an OSD.

Ans: RADOS is the underlying storage cluster, but the access methods (block, object, and file) stripe their data across many RADOS objects, which CRUSH very effectively distributes across all of the servers. A 100MB read or write turns into dozens of parallel operations to servers all over the cluster.

The problem with reading from random/multiple replicas by default is cache efficiency. If every reader picks a random replica, then there are effectively N locations that may hae an object cached in RAM (instead of on disk), and the caches for each OSD will be about 1/Nth as effective. The only time in makes sense to read from replicas is when you are CPU or network limited; the rest of the time it is better to read from the primary’s cache than a replica’s disk.

How write get performed?

OSDs use a write-ahead mode for local operations: a write hits the journal first, and from there is then being copied into the backing filestore.

This, in turn, leads to a common design principle for Ceph clusters that are both fast and cost-effective:

Another common design principle is that you create one OSD per spinning disk that you have in the system. Many contemporary systems come with only two SSD slots, and then as many spinners as you want. That is not a problem for journal capacity — a single OSD’s journal is usually no larger than about 6 GB, so even for a 16-spinner system (approx. 96GB journal space) appropriate SSDs are available at reasonable expense.

Buzz sentence for me these days “How to compile gcc on Solaris10 for x86 platform”. There are a lot of problems and missing links exists in the latest source code available for gcc. Use following steps to get gcc compiled successfully on S10 for x86.
1) Download the gcc source code using any tool available to you, for example in my case i used svn to download the aource tree of gcc.
svn checkout http://gcc.gnu.org/svn/gcc/trunk gcc
This will create a gcc directory with all source files init.
2) Download latest binutils from http://ftp.gnu.org/gnu/binutils/ to compile loader and other tools.
3) Optional thing is to create a seprate directory by linking all the source files for running gmake.
4) Now first compile binutil directory using following command
./configure –prefix= –with-gmp= –with-mpfr=

** Here –with-gmp and –with-mpfr paths need to set before running configure command. So if you do not have gmp and mprf library then first download and compile them.
The GNU MP Bignum Library
The MPFR Library

5) There could the case when gcc loader could not load some directories then we need to copy that particular .so file in $BuildDir/$BuildDir/lib/ . For this i think there is some bug with binutils loader that is why it could not load same library from /usr/bin or /lib/. So need to go for above work around.