This blog will be covering my attendance at the QCon 2011 conference in San Francisco, California from November 16 to 18. QCon is a software development conference on many useful topics which I will try my best to summarize through my posts. Please, feel free to post your comments and get some conversation started. Enjoy!

Essentially, they followed the following steps to validate their solutions

Defined POC Use cases

Did a Solution Comparison based on their technology survey

They prototyped

they benchmarked

AutoSupport: Hadoop Use case in POC

with a single 10 node hadoop cluster (on E-series with 60 disks of 2TB), they were able to change a 24 billion records query that took 4 weeks to load into a 10.5 hours load time and a previously impossible 240 billions query into one that now runs in only 18 hours!

All their machines run on RedHat Entrerprise (RHEL) 5.6, using ext3 filesystem

Hadoop Architecture components

Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Hadoop’s HDFS.

HDFS: Hadoop Distributed File System, is the primary storage system used by Hadoop applications.

HBase: the Hadoop database to be used when you need random, realtime read/write access to your Big Data.

MapReduce: a software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.

Pig: from Apache, it is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.