چکیده انگلیسی

Medical data mining has recently become one of the most popular topics in the data mining community. This is due to the societal importance of the field and also the particular computational challenges posed in this domain of data mining. However, current medical data mining approaches oftentimes use identical costs or just ignore them for the different cases of classification errors. Thus, their outcome may be unexpected. This paper applies a new meta-heuristic approach, called the Homogeneity-Based Algorithm (or HBA), for optimizing the classification accuracy when analyzing some medical datasets. The HBA first expresses the objective as an optimization problem in terms of the error rates and the associated penalty costs. These costs may be dramatically different in medical applications as the implications of having a false-positive and a false-negative case may be tremendously different. When the HBA is combined with traditional classification algorithms, it enhances their prediction accuracy. It does so by using the concept of homogenous sets. Five medical datasets, obtained from the machine learning data repository at the University of California, Irvine (UCI), USA, were tested. Some computational results indicate that the HBA, when it is combined with traditional methods, can significantly outperform current stand-alone data mining approaches.

مقدمه انگلیسی

Increasing powerful mechanisms for storing data has made available lots of datasets related to medicine in recent decades. A motivation for extracting useful knowledge from such datasets and thus discovering decision-making insights for the diagnosis and treatment of diseases, is also increasingly recognized. In the typical setting a dataset of historic data, which describe some type of disease or a medical disorder, is assumed to be available. Such datasets consist of records of patients describing physical and laboratory examinations related to that type of disease or medical disorder. Then, the computational challenge is how to develop a diagnostic system, which could assist in diagnosing this type of ailment based on the knowledge extracted from the historic dataset. At this point, human analysts need special computational tools to process and comprehend such large and complex datasets.
Medical data mining can assist in addressing such challenges. Data mining analysts can extract decision regions from a given historic dataset related to a medical condition or disease. Usually, such decision regions consist of medical indicators, which could be used to diagnose the condition or disease. In medical diagnosis (as in most other domains), usually there are three different cases of possible errors:
•
The false-negative case in which a patient, who in reality has the disease, is diagnosed as disease free.
•
The false-positive case in which a patient, who in reality does not have the disease, is diagnosed as having the disease.
•
The unclassifiable case in which the prediction system cannot diagnose a given case. This happens due to insufficient knowledge extracted from the historic data.
Under the above considerations, current medical data mining approaches oftentimes assign identical penalty costs for the false-positive and the false-negative cases or just ignore the penalty cost for the unclassifiable cases. Such approaches will be discussed in Section 2. Thus, their outcome may be unexpected or even unacceptable.
The two penalty costs for the false-positive and the false-negative cases could be dramatically different in a medical application. For instance, in the case of a life threatening condition where time is of essence, if one diagnoses a given case as false-negative, then his/her medical condition goes untreated or is treated inadequately. Thus precious time may be wasted and the situation may turn out to be eventually fatal to the patient. On the other hand, for the same situation, a false-positive diagnosis may just add some financial costs and anxiety to the patient but not result in a life threatening condition.
A penalty cost for unclassifiable cases in medical data mining is needed as well. A diagnosis of a patient as an unclassifiable case may require additional medical examinations and involve some costs. However, that particular case may not necessarily result in a wrong diagnosis.
For the above reasons, this paper applies a new meta-heuristic approach, called the Homogeneity-Based Algorithm (or HBA) as developed by Pham and Triantaphyllou, 2007 and Pham and Triantaphyllou, 2008, on some well-known medical datasets. The HBA first defines the total misclassification cost of models extracted from classification algorithms as an optimization problem in terms of the false-positive, the false-negative, and the unclassifiable rates along with their penalty costs. The HBA then organizes the extracted models as mutually exclusive decision regions represented by homogeneous sets. These decision regions are refined based on their density by employing a genetic algorithm (GA) approach. This is done in order to minimize the total misclassification cost. The HBA is motivated by the large discrepancy in the previous three penalty costs.
The next section provides a literature review of some related developments. The third section has a brief description of the HBA as adopted from Pham and Triantaphyllou, 2007 and Pham and Triantaphyllou, 2008. That section shows how the HBA can yield an optimal or near optimal misclassification total cost. The fourth section discusses some computational results from the medical domain. These results give an indication of how this methodology may improve the prediction accuracy in computerized medical diagnosis. The paper ends with some conclusions and an appendix, which describes the key algorithmic aspects of the HBA.

نتیجه گیری انگلیسی

Medical datasets may possess large amounts of useful information about patients and their medical conditions which may still be unknown to the medical community. Relationships among key attributes of the data and decision regions within these datasets could unveil new and important medical knowledge by using medical data mining approaches. However, current medical data mining approaches oftentimes use identical costs or just ignore the costs for the three different types of classification errors. Thus, the performance of such data mining approaches may be coincidental.
This paper applied a meta-heuristic approach, called the Homogeneity-Based Algorithm (HBA). That is, the HBA first defined the main objective as an optimization problem in terms of the false-positive, false-negative, and unclassifiable rates along with their associated penalty costs. When the HBA is combined with traditional classification algorithms (such as SVMs, DTs, ANNs) then it may significantly enhance their prediction accuracy by using the concept of homogenous sets. The HBA was analyzed on the following well-known medical datasets: the one for the Pima Indian diabetes, the one known as the Haberman Surgery Survival dataset, the Breast Cancer dataset, the Liver Disorders dataset, and the Appendicitis dataset. Each dataset was analyzed under some representative different penalty costs. The derived results clearly show that the total misclassification costs (TCs) obtained under the HBA approach are less than the TCs achieved by the traditional stand-alone approaches. This appears to have important implications for the computerized diagnosis and treatment of these diseases.
Regarding the penalty costs for classification errors, a theoretical model proposed by Thomas and Hofer (1999) can be used to find their optimal values. Furthermore, analyses on the HBA show that medical datasets which have higher numbers of attributes (i.e., greater than 9 or 10) cannot be tested because of HBA’s high complexity. An appropriate solution to decrease HBA’s complexity might be to use certain distance based approaches for determining homogenous sets as described in Turner (1989). Current work by the authors of this paper now focuses on developing such alternative approaches, which could also be used in conjunction with traditional data mining methods.