Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

12.
Hadoop System Principles
• Scale-Out rather than Scale-Up
• Bring code to data rather than data to code
• Deal with failures – they are common
• Abstract complexity of distributed and
concurrent applications
13

13.
Scale-Out Instead of Scale-Up
• It is harder and more expensive to scale-up
– Add additional resources to an existing node (CPU, RAM)
– Moore’s Law can’t keep up with data growth
– New units must be purchased if required resources can not be
added
– Also known as scale vertically
• Scale-Out
– Add more nodes/machines to an existing distributed
application
– Software Layer is designed for node additions or removal
– Hadoop takes this approach - A set of nodes are bonded
together as a single distributed system
– Very easy to scale down as well
14

16.
Failures are Common
• Given a large number machines, failures are
common
– Large warehouses may see machine failures weekly or
even daily
• Hadoop is designed to cope with node
failures
– Data is replicated
– Tasks are retried
17

18.
History of Hadoop
19
• Started as a sub-project of Apache Nutch
– Nutch’s job is to index the web and expose it for searching
– Open Source alternative to Google
– Started by Doug Cutting
• In 2004 Google publishes Google File System
(GFS) and MapReduce framework papers
• Doug Cutting and Nutch team implemented
Google’s frameworks in Nutch
• In 2006 Yahoo! hires Doug Cutting to work on
Hadoop with a dedicated team
• In 2008 Hadoop became Apache Top Level
Project
– http://hadoop.apache.org

19.
Naming Conventions?
• Doug Cutting drew inspiration from his
family
– Lucene: Doug’s wife’s middle name
– Nutch: A word for "meal" that his son used as a toddler
– Hadoop: Yellow stuffed elephant named by his son
20

21.
Comparisons to RDBMS (Continued)
• Structured Relational vs. Semi-Structured
vs. Unstructured
– RDBMS works well for structured data - tables that
conform to a predefined schema
– Hadoop works best on Semi-structured and Unstructured
data
• Semi-structured may have a schema that is loosely
followed
• Unstructured data has no structure whatsoever and is
usually just blocks of text (or for example images)
• At processing time types for key and values are chosen by
the implementer
– Certain types of input data will not easily fit into
Relational Schema such as images, JSON, XML, etc...
22

22.
Comparison to RDBMS
• Offline batch vs. online transactions
– Hadoop was not designed for real-time or low latency
queries
– Products that do provide low latency queries such as
HBase have limited query functionality
– Hadoop performs best for offline batch processing on
large amounts of data
– RDBMS is best for online transactions and low-latency
queries
– Hadoop is designed to stream large files and large
amounts of data
– RDBMS works best with small records
23

23.
Comparison to RDBMS
• Hadoop and RDBMS frequently complement
each other within an architecture
• For example, a website that
– has a small number of users
– produces a large amount of audit logs
24
Web Server RDBMS Hadoop1
2
4
3
1
2
3
Utilize RDBMS to provide rich User
Interface and enforce data integrity
RDBMS generates large amounts of audit
logs; the logs are moved periodically to
the Hadoop cluster
All logs are kept in Hadoop; Various
analytics are executed periodically
4 Results copied to RDBMS to be used
by Web Server; for example
"suggestions" based on audit history

25.
Hadoop Eco System
• To start building an application, you need a file
system
– In Hadoop world that would be Hadoop Distributed File System
(HDFS)
– In Linux it could be ext3 or ext4
• Addition of a data store would provide a nicer
interface to store and manage your data
– HBase: A key-value store implemented on top of HDFS
– Traditionally one could use RDBMS on top of a local file system
26
Hadoop Distributed FileSystem (HDFS)
HBase

26.
Hadoop Eco System
• For batch processing, you will need to
utilize a framework
– In Hadoop’s world that would be MapReduce
– MapReduce will ease implementation of distributed
applications that will run on a cluster of commodity
hardware
27
Hadoop Distributed FileSystem (HDFS)
HBase
MapReduce

29.
Hadoop Eco System
• Your organization may have a good number
of SQL experts
– Addition of Apache Hive, a data warehouse solution that
provides a SQL based interface, may bridge the gap
30
Hadoop Distributed FileSystem (HDFS)
HBase
MapReduce
Oozie Pig Hive

30.
Hadoop Distributions
31
• Let’s say you go download Hadoop’s HDFS and
MapReduce from http://hadoop.apache.org/
• At first it works great but then you decide to start
using HBase
– No problem, just download HBase from
http://hadoop.apache.org/ and point it to your existing HDFS
installation
– But you find that HBase can only work with a previous version
of HDFS, so you go downgrade HDFS and everything still works
great
• Later on you decide to add Pig
– Unfortunately the version of Pig doesn’t work with the version of
HDFS, it wants you to upgrade
– But if you upgrade you’ll break HBase...

31.
Hadoop Distributions
• Hadoop Distributions aim to resolve version
incompatibilities
• Distribution Vendor will
– Integration Test a set of Hadoop products
– Package Hadoop products in various installation formats
• Linux Packages, tarballs, etc.
– Distributions may provide additional scripts to execute
Hadoop
– Some vendors may choose to backport features and bug
fixes made by Apache
– Typically vendors will employ Hadoop committers so the
bugs they find will make it into Apache’s repository
32

33.
Cloudera Distribution for
Hadoop (CDH)
• Cloudera has taken the lead on providing
Hadoop Distribution
– Cloudera is affecting the Hadoop eco-system in the same
way RedHat popularized Linux in the enterprise circles
• Most popular distribution
– http://www.cloudera.com/hadoop
– 100% open-source
• Cloudera employs a large percentage of
core Hadoop committers
• CDH is provided in various formats
– Linux Packages, Virtual Machine Images, and Tarballs
34

42.
Summary
• We learned about
– Data storage needs are rapidly increasing
– Hadoop has become the de-facto standard for handling
these massive data sets
– The Cloudera Distribution for Hadoop (CDH) is the most
commonly used Hadoop release distribution
– There is a number of Hadoop related publications
available
43