I downloaded the Hortonworks sandbox today. I’m using the version that runs as a virtual machine under Oracle VirtualBox. The sandbox can run in as little as 2GB RAM, but requires 4GB in order to enable Ambari and HBase. Good thing that I have 8GB in my laptop.

The “Hello World” tutorial provided me with hands on:

Uploading a file into HCatalog

Typing queries into Beeswax, which is a GUI into Hive

Running a more complex query by writing a short script in Pig

There are a lot more tutorials. I’ll update this blog post after I finish each tutorial.

You can use JDBC and ODBC drivers to interface with your traditional systems. However, it’s not high performance.

Originally built by (and still used by) Facebook to bring traditional database concepts into Hadoop in order to perform analytics. Also used by Netflix to run daily summaries.

Pig is sometimes compared to Hive, in that they are both “languages” that are layered on top of Hadoop. However, Pig is more analogous to a procedural language to write applications, while Hive is targeted at traditional DB programmers moving over to Hadoop.

HortonWorks / Apache Tez provides an alternative to MapReduce in order to process near real time jobs at petabyte scale. The HortonWorks Stinger project utilizes Tez in order to increase the speed of Hive and Pig by an order (or multiple orders) of magnitude.

Tez is based on a multiple stage dataflow architecture: pre-processor, sampler, partition, aggregate in contract to the traditional Map and Reduce.

Tez assumes use of Yarn for resource acquisition, so cannot be run in legacy environments. Also assumed is complex user defined logic to eliminate duplicate work in order to increase performance. Legacy Hadoop assumes duplicate work, made less painful by the massive scale of the cluster, and the benefit of redundancy.

Tez may also run multiple instances within a single Yarn container, which reduces the overhead of additional containers. However, this may decrease efficient resource utilization on a very large scale since using many Yarn containers help to allocate every last available hardware resource, as opposed to Tez squeezing as much as possible within fewer containers.

Posted onNovember 3, 2013|Comments Off on Cassandra – NoSQL database to use in conjunction with Hadoop

Some use cases feed data directly into Hadoop from their source (such as web server logs), but others feed into Hadoop from a database repository. Still others have use cases in which there is a massive output of data that needs to be stored somewhere for post-processing. One model for handling this dataset is a NoSQL database, as opposed to SQL or flat files.

Cassandra is an Apache project that is popular for its integration into the Hadoop ecosystem. It can be used with components such as Pig, Hive, and Oozie. Cassandra is often used as a replacement for HDFS and HBase since Cassandra has no master node, so eliminates a single point of failure (and need for traditional redundancy). In theory, its scalability is strictly linear; doubling the number of nodes will exactly double the number of transactions that can be processed per second. It also supports triggers; if monitoring detects that triggers are running slowly, then additional nodes can be programmatically deployed to address production performance problems.

Cassandra was first developed by Facebook. The primary benefit of its easily distributed infrastructure is the ability to handle large amount of reads and writes. The newest version (2.0) solves many of the usability problems encountered by programmers.