Bottom Line:
The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce.The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost.Complexity and qualitative results analysis shows significant performance improvement.

ABSTRACTLarge quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.

pone.0136259.g006: Analysis based on number of algorithms in an MRPack job with respect to time execution time in terms of I/O and network communication.

Mentions:
We test the performance gain and loss when the number of algorithms in MRPack varies. In this case, we particularly monitor the I/O communication during job execution with HDFS. In MapReduce, each job requires to retrieve data from HDFS and write it back; hence, it results in long running jobs with heavy I/O operations as shown in Fig 6. However, in MRPack, the algorithms are executed as a single job and the I/O operations are performed only once. Hence, the performance is significantly affected and improved compared to that of MapReduce. In MapReduce, when we increase the number of algorithms, the number of jobs to be separately executed also increases. In MRPack, increasing the number of algorithms means changing the algorithms only in a single Job. By executing a single job, significant performance improvement is achieved, as shown in Fig 6. However, there are some memory-based limitations to this method, as discussed in the coming sections.

pone.0136259.g006: Analysis based on number of algorithms in an MRPack job with respect to time execution time in terms of I/O and network communication.

Mentions:
We test the performance gain and loss when the number of algorithms in MRPack varies. In this case, we particularly monitor the I/O communication during job execution with HDFS. In MapReduce, each job requires to retrieve data from HDFS and write it back; hence, it results in long running jobs with heavy I/O operations as shown in Fig 6. However, in MRPack, the algorithms are executed as a single job and the I/O operations are performed only once. Hence, the performance is significantly affected and improved compared to that of MapReduce. In MapReduce, when we increase the number of algorithms, the number of jobs to be separately executed also increases. In MRPack, increasing the number of algorithms means changing the algorithms only in a single Job. By executing a single job, significant performance improvement is achieved, as shown in Fig 6. However, there are some memory-based limitations to this method, as discussed in the coming sections.

Bottom Line:
The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce.The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost.Complexity and qualitative results analysis shows significant performance improvement.

ABSTRACTLarge quantities of data have been generated from multiple sources at exponential rates in the last few years. These data are generated at high velocity as real time and streaming data in variety of formats. These characteristics give rise to challenges in its modeling, computation, and processing. Hadoop MapReduce (MR) is a well known data-intensive distributed processing framework using the distributed file system (DFS) for Big Data. Current implementations of MR only support execution of a single algorithm in the entire Hadoop cluster. In this paper, we propose MapReducePack (MRPack), a variation of MR that supports execution of a set of related algorithms in a single MR job. We exploit the computational capability of a cluster by increasing the compute-intensiveness of MapReduce while maintaining its data-intensive approach. It uses the available computing resources by dynamically managing the task assignment and intermediate data. Intermediate data from multiple algorithms are managed using multi-key and skew mitigation strategies. The performance study of the proposed system shows that it is time, I/O, and memory efficient compared to the default MapReduce. The proposed approach reduces the execution time by 200% with an approximate 50% decrease in I/O cost. Complexity and qualitative results analysis shows significant performance improvement.