AdSense

Saturday, 1 February 2014

Hadoop's architecture birdview

Hadoop's architecture birdview as said by Mike Olson - CEO of Cloudera

Hadoop is designed to run on a large number of machines that don't share any memory or disks.

That means you can buy a whole bunch of commodity servers, slap them in a rack, and run the Hadoop software on each one.

When you want to load all of your organization's data into Hadoop, what
the software does is bust that data into pieces that it then spreads
across your different servers. There's no one place where you go to talk
to all of your data; Hadoop keeps track of where the data resides.
And because there are multiple copy stores, data stored on a server that
goes offline or dies can be automatically replicated from a known good
copy.

In a centralized database system, you have got one big disk connected to
four or eight or 16 big processors. But that is as much horsepower as
you can bring to bear. In a Hadoop cluster, every one of those servers
has two or four or eight CPU cores.

You can run your indexing job by sending your code to each of the dozens
of servers in your cluster, and each server operates on its own little
piece of the data. Results are then delivered back to you in a unified
whole. That's MapReduce: you map the operation out to all of those
servers and then you reduce the results back into a single result set.

Architecturally, the reason you're able to deal with lots of data is
because Hadoop spreads it out. And the reason you're able to ask
complicated computational questions is because you have got all of these
processors, working in parallel, harnessed together.