Source:

URL:

Abstract:

Data management becomes a complex task when hundreds of petabytes of data are being gathered, stored and processed on a day to day basis. Efficient processing of the exponentially growing data is inevitable in this context. This paper discusses about the processing of a huge amount of data through Support Vector machine (SVM) algorithm using different techniques ranging from single node Linier implementation to parallel processing using the distributed processing frameworks like Hadoop. Map-Reduce component of Hadoop performs the parallelization process which is used to feed information to Support Vector Machines (SVMs), a machine learning algorithm applicable to classification and regression analysis. Paper also does a detailed anatomy of SVM algorithm and sets a roadmap for implementing the same in both linear and Map-Reduce fashion. The main objective is explain in detail the steps involved in developing an SVM algorithm from scratch using standard linear and Map-Reduce techniques and also conduct a performance analysis across linear implementation of SVM, SVM implementation in single node Hadoop, SVM implementation in Hadoop cluster and also against a proven tool like R, gauging them with respect to the accuracy achieved, their processing pace against varying data sizes, capability to handle huge data volume without breaking etc