Cloudera’s Distribution including Apache Hadoop (CDH)• A single, easy-to-install package from the Apache Hadoop core repository• Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version• 100% open source

HP Vertica was the first analytic database company to deliver a Hadoop Connector. HP Vertica now offers two connectors to transfer data seamlessly between Hadoop and HP Vertica:

The Hadoop Distributed File System (HDFS) connector enables you to load data from HDFS using the HP Vertica native COPY facility. This mechanism simplifies and accelerates the process of loading data stored in HDFS without any MapReduce coding. The connector also ensures that data is loaded from the Hadoop cluster with the optimal amount of parallelism. By using the connector with the HP Vertica External Tables feature, you can even query data in HDFS without copying data into HP Vertica.

The Hadoop & Pig Connector is bidirectional and enables you to move data from Hadoop to HP Vertica or vice versa via either MapReduce or Pig jobs.

With HP Vertica HDFS and Pig Connectors, you have unprecedented flexibility and speed in loading data from HDFS to the HP Vertica Analytics Platform and querying data from the HP Vertica Analytics Platform in Hadoop. The HP Vertica HDFS and Pig Connectors are open source, supported by HP Vertica, and available for download.

Largest company that uses Hadoop is probably Yahoo or FacebookOn February 19, 2008, Yahoo! Inc. launched what it claimed was the world’s largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a more than 10,000 core Linux cluster and produces data that is used in every Yahoo! Web search query.In June 2012 Facebook claimed that they had the largest Hadoop cluster in the world with 100 PB of storage.AmazonAncestryAmerican AirlinesAkamaiAppleAVGGoogleIBMTwitterSpotifyElectronic Artsthe New York TimesNokiaTrend MicroNavteqSearsActivisionaggregate knowledgeskybox imagingAOLGravity InteractivejiwireSamsungexplorysCBS interactiveebayDataxuQualcommappnexusOPOWERADGOOROOmobileposseconcurrenttyntKLOUTpulse pointHuron Consulting groupRap LeaftruliaSRAApollo group incGrouponshopzillaadconion

There are many alternatives to Hadoop, but the others are far behind. Hadoop is the undisputed leader in Big Data.The most promising alternative to Hadoop is SparkSpark(http://www.spark-project.org/) is one more open source system developed at the UC Berkeley AMP Lab. Users include UCB, Conviva, Klout, and Quantifind, among others.

The claim is, it runs 100x times faster than Hadoop in scenarios like iterative algorithms and interactive data mining . Spark is also used for data processing. Spark is also the engine behind Shark, a fully Apache Hive-compatible data warehousing system that can run 100x faster than Hive. A comparison of the performance of logistic regression using Hadoop MapReduce and Spark is shown in the below figure (advertised).

Spark benefits from in memory compared to hadoop’s disk based. It can cache datasets in memory to speed up reuse.

There might not be one solution fit all kind of framework and therefore, its wise to evaluate other related distributed frameworks like Spark which could help in achieving solution to specialized kind of scenarios/problem. Compatibility with hadoop is a plus.

We use Big Data especially for unstructured data. An example of unstructured data is collection of customers social media posts for analyzing on customer behavior. 80% of data captured today is unstructured, from sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals, to name a few. All of this unstructured data is Big Data.

Organizations are discovering that important predictions can be made by sorting through and analyzing Big Data. However, since 80% of this data is “unstructured”, it must be formatted (or structured) in a way that that makes it suitable for data mining and subsequent analysis. Hadoop is the most popular Big Data package. Hadoop is the core platform for structuring Big Data, and solves the problem of making it useful for analytics purposes.