Mapreduce, Hadoop and R - Page 2

Any of the tools that are wide column or even key-value stores, particularly HBase and Hypertable, can integrate with Hadoop as you can see in the list. However, you can also see that a number of tools do not use Hadoop and instead rely on other storage methods.

If you don't want to use one of the NoSQL databases and mess with the Hadoop integration, there are two tools that the Hadoop community has designed to work with Hadoop to give it some search capability. The first, called Hive, is a data warehouse package that also has some querying features. More precisely, it can perform data summarization and some ad-hoc queries along with larger scale analysis of data sets stored in Hadoop. The queries are handled with an SQL-like language called HiveQL that allows you to perform basic searches on the data as well as allow you to plug in your own mapping code and reduction code into the code (see subsequent discussion about MapReduce).

The second tool is Pig. Pig goes beyond just querying the data and adds analysis capabilities (see subsequent section on analytics using R). Like Hadoop, Pig was designed for parallelism, which makes it a good fit for Hadoop. Pig has a high-level language called Pig Latin, which allows you to write data analysis code for accessing data that resides in Hadoop. This language is compiled to produce a series of MapReduce programs that run on Hadoop (see subsequent section on MapReduce).

Hadoop is one of the technologies people are exploring for enabling Big Data. If you use Google to search on Hadoop architectures, you will find a number of links, but generally the breadth of applications and data in Big Data is so large that it is impossible to develop a general Hadoop storage architecture. With that said, here are some general rules of thumb or guidelines in architecting a Hadoop storage solution:

You need enough disks in a datanode to satisfy the application IO demands. This means that the number of disks can vary by quite a bit, but you must understand how the application is accessing the data (IO pattern) and the related IO requirements to properly chose the number of disks. Is the data access more streaming oriented or more IOPS oriented? Are there more reads than writes? How much does IO influence the run time?

A second rule of thumb is to define how much parallelism you think you might need. This information can tell you how many datanodes you need, which also tells you how many datanodes may be accessing the same data file. So if you think you can get lots of parallelism in your application, then you will need a fair number of datanodes, and you may have to increase the number of data copies from three to a larger number.

If you don't have enough datanodes, your network traffic will increase rather alarmingly. This is because Hadoop may need to copy data from one datanode to where it is needed (another datanode). Hadoop may also need to do some housekeeping in the background, such as deleting too many copies of the data or updating the copies, which also puts more pressure on the network.

As a general rule of thumb, put enough capacity in each datanode to hold the largest file. Remember, Hadoop doesn't do striping so the entire file is located on the datanode. Hence, it must reside on the node in its entirety. Don't skimp on disks.