All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Heart Disease Analysis System Using Data
Mining Techniques

G.Karthiga1, C.Preethi1, R.Delshi Howsalya Devi2

Department of Computer Science and Engineering , KLN College of Engineering, Sivagangai, India

Abstract

Data mining is an iterative progress in which evolution is defined by detection, through usual or manual methods. Knowledge discovery and data mining have found various applications in scientific domain Heart disease is a term for defining a huge amount of healthcare conditions that are related to the heart. This medicinal condition defines the unpredicted health conditions that directly control all the parts of the heart. Different data mining techniques such as association rule mining, classification, clustering are used to predict the heart disease in health care industry .The heart disease database is preprocessed to make the mining process more efficient. The preprocessed data is clustered using clustering algorithms like K-means to cluster relevant data in database. Maximal Frequent Itemset Algorithm (MAFIA) is used for mining maximal frequent patterns in heart disease database. The frequent patterns can be classified using C4.5 algorithm as training algorithm using the concept of information entropy. The results showed that the designed prediction system is capable of predicting the heart attack with good accuracy.

Keywords

INTRODUCTION

Data mining is process of extracting useful information from large amount of databases.Data mining is most useful in
an exploratory analysis because of nontrivial information in large volumes of data. Data mining is the process of
extracting data for finding buried patterns which can be transformed into significant. Data mining knowledge afford a
user-oriented approach to new and concealed patterns in the data.

The knowledge which is exposed can be used by the healthcare practitioners to get better quality of service and to
reduce the extent of adverse medicine effect. Hospitals have to reduce the charge of medical tests. They can attain these
consequences by employing suitable decision support systems. Health care data is enormous. It consists of patient
centric data, resource organization data and altered data. Medical care organizations must have capability to explore
data. Treatment records of millions of patients can be hoarded and data mining techniques will aid in answering
numerous essential and decisive questions interrelated to health care.Data mining techniques has been performed in
healthcare domain. This realization is in the arouse of explosion of difficult medical data. Medicinal data mining can
utilize the veiled patterns present in huge medical data which otherwise is left undiscovered. Data mining techniques
which are useful to medical data include association rule mining for finding frequent patterns, prediction, classification
and clustering. Data mining techniques are more useful in predicting heart diseases, lung cancer, and diabetes.

This paper analyzes the heart disease predictions using classification algorithms. These hidden patterns can be used for
health diagnosis in medicinal data. Data mining technology afford an effective approach to latest and indefinite patterns in the data. The information which is identified can be used by the healthcare administrators to get better services.
Heart disease was the most important reason of victims in the countries like India, United States. Data mining
techniques like Association Rule Mining, Clustering, Classification algorithms such as Decision tree [7], C4.5
algorithm, Neural Network [8], Naive Bayes [9] are used to explore the different kinds of heart based problems [1].
Data mining techniques like C4.5 algorithm and K-means clustering are used for validating the accuracy of medicinal
data. These algorithms can be used to optimize the data storage for practical and legal purposes.

RELATED WORK

The data mining techniques includes different works to explore a variety of diseases such as Cancer, Diabetes, Heart
diseases. Heart disease is the most important reason of fatality in the UK, USA, Canada, and England [2]. Heart disease
kills individual in each 32 seconds in the world. Jyoti Soni et al proposed three different supervised machine learning
algorithms for heart disease prediction. They are Naïve Bayes, K-nearest neighbor, and Decision tree. These algorithms
have been used for analyzing the heart disease. Tanagra is the data mining tool used for classifying these medical data
and these data are calculated using 10 fold cross validation. Naive Bayes algorithm performs well when compared to
other algorithms [3].

Genetic algorithm have been used in [6], to reduce the definite data size to obtain the best possible subset of attribute
which is essential for heart disease prediction. Classification is supervised learning method to extract models relating
main classes of data. Decision Tree, Naïve Bayes and Classification via clustering are the three classifiers used to
analyze the occurrence of heart disease for the patients. Shekar et al proposed new algorithm to mine association rules
from medical data based on digit sequence and clustering for heart attack prediction the entire data base is divided into
partitions of equal size, each partition will be called cluster. This approach reduces main memory requirement since it
consider only a small cluster at a time and it is scalable and efficient [5].

MATERIALS AND METHODS

The extraction of significant patterns from the heart disease data warehouse is presented in this section. The heart
disease data warehouse contains the screening clinical data of heart patients. Initially, the data warehouse is
preprocessed to make the mining process more efficient. Four steps for the Disease Prediction Process are as follows:

1. Preprocessing is done in order to handle missing values.

2. Cluster relevant data using K-means Clustering.

3. Selecting the frequent pattern using MAFIA

4. Applying C4.5 algorithm to classify the pattern.

Data Preprocessing

Cleaning and filtering of the data might be necessarily carried out with respect to the data and data mining algorithm
employed so as to avoid the creation of deceptive or inappropriate rules or patterns. In preprocessing first it selects an
attribute for selecting a subset of attributes with good predicting capability. It handles all missing values and
investigates each possibility. If an attribute has more than 5% missing values then the records should not be deleted and
it is advisable to impute values where data is missing, using a suitable method.

K-means Clustering

Grouping a set of objects in such a way that objects in the same group is more similar to each other than to those in
other groups. Clustering is an unsupervised learning. The algorithm clusters information’s into k groups, where k is
considered as an input parameter. It then assigns each information’s to clusters based upon the observation’s proximity
to the mean of the cluster. The cluster’s mean is then more computed and the process begins again.The k-means
algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging and related fields.
The steps involved in K-means algorithms are as follows:

- Choose the number of clusters, k.

- Randomly create k clusters and find the cluster midpoint.

- Consign each point to the closest cluster midpoint

- Recompute

the new cluster midpoints.

- Iterate the above two steps until various convergence condition is met.

The preprocessed data is clustered using k-means to cluster relevant data in the heart disease database shown in Table
I.K-means algorithm produces a specific number of separate, non-hierarchical clusters. It is a simplest algorithm and performs faster than other clustering algorithms. It allows run on huge databases. It is mainly suitable for globular
clusters.

MAFIA

The association rule mining problem is a main predicament in the data mining field with various realistic applications
such as user medical data analysis, intrusion detection.
MAFIA is used for mining maximal frequent item sets from a transactional database [4]. This algorithm is mainly
efficient when the item sets in the database are very long. The search strategy of this algorithm integrates a depth-first
traversal of the item set lattice with efficient pruning mechanisms. A ⊆ I an item set, and call A a k-item set if the
cardinality of item set A is k. Let database X be a multiset of subsets of I, and let support (A) be the fraction of item
sets B in X such that A ⊆ B.. If support(A)=minSup, then A is a frequent item set, and indicate the set of all Frequent
Item sets(MFI) by FI. If A is recurrent and no superset of A is frequent, then A is a Maximally Frequent Item set, and
denotes the set of all Maximally Frequent Item sets by MFI.

MAFIA efficiently stores the transactional database as a series of vertical bitmaps, where each bitmap represents an
item set in the database and a bit in each bitmap represents whether or a given customer has the corresponding item
set. Initially, each bitmap represents an item set in database. The item sets that are checked for frequency in the
database become recursively longer and the vertical bitmap representation works perfectly in conjunction with this
item set extension

C4.5 Algorithm

Classification is an unsupervised learning used to predict the class of objects whose class label is unknown. It is used
for creating classification rules by means of decision trees from a given data set.
Decision tree is used as a prognostic model. C4.5, C5.0, CART, ID3 are methods for building decision trees. It is an
extension of the basic ID3 algorithm. By using C4.5, decision trees can be building from a set of training data with the
information entropy. It is a statistical classifier. It outputs can be in the form of if then rules

EXPERIMENTAL RESULTS

The result of experimental analysis in identifying important patterns for predicting heart diseases are presented in
Figure 1. The heart disease database is preprocessed effectively by removing related records and given that missing
values .The well mannered heart disease data set[10], resulting from preprocessing, is then composed by K-means
algorithm with the K value of 2.Then the frequent forms are mined efficiently from the set appropriate to heart disease,
using the MAFIA .

The experimental results of this approach as presented in Figure1.Confusion matrix is a matrix representation of the
classification results. From the confusion matrix to explore the performance criterion for the classifiers in disease
prediction accuracy have been calculated for the medical datasets.

These can be transformed into True-Positive (TP), True-Negative (TN), False- Negative (FN) and False-Positive (FP)
metrics.

(5.1) TNFNFP+TPTN TP =Accurac

True Positive (TP): Total fraction of Data Pre members classified as Class A belongs to Class A

False Positive (FP): Total fraction of members of Class A but does not belong to Class A.

False Negative (FN): Total fraction of members of Class A incorrectly classified as not belonging to Class A

True Negative (TN): Total fraction of members which does not belong to Class A are classified not a part of
Class A .It can also be given as(100%-FP).

CONCLUSIONS AND FUTURE WORK

Medical related information’s are huge in nature and it can be derived from different birthplaces which are not entirely
applicable in feature. In this work, heart disease prediction system was developed using clustering and classification
algorithms to predict the effective risk level and accuracy of the patients. In future work, we have planned to propose
an effective disease prediction system to predict the heart disease with better accuracy using different data mining
techniques and compare the performance of algorithm with other related data mining algorithms.

ACKNOWLEDGEMENT

The authors would like to thank the reviewers for their valuable comments that would help to improve this