Abstract

Data mining refers to a process that aims at extracting knowledge by discovering new patterns from large datasets. Classification is a data mining task that generalizes an established, proven structure to apply to new data. A dominant area of modern-day research is the field of medical investigations that include disease prediction and malady categorization. In this paper, our focus is to design an efficient classifier that is trained to classify oncogenic data. The Lymphographic dataset is utilized by means of machine learning techniques to train the classifier using feature selection and classification algorithms. Feature selection is a supervised method that attempts to select a subset of the predictor features based on the information gain. The Lymphography dataset comprises of 18 predictor attributes and 148 instances with the class label having four distinct values. This paper highlights the performance of sixteen classification algorithms on the Lymphographic dataset that enables the classifier to accurately perform multi-class categorization of medical data. Furthermore our research work also places emphasis on the performance of four feature selection algorithms and their impact on the classification accuracy. Our work asserts the fact that the Random Tree algorithm and the Quinlan’s C4.5 algorithm give 100 percent classification accuracy with all the predictor features and also with the feature subset selected by the Fisher Filtering feature selection algorithm. Moreover ReliefF feature selection algorithm gives improved results for Radial Basis Function algorithm improving the classifier accuracy by 1.35%. It is also stated here that the C4.5 algorithm offers more efficient classification since the decision tree size generated is smaller than the Random Tree.