The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop the highest prevalence of type 2 diabetes and for this reason they have been subject of many studies.

The type of dataset and problem is a classic supervised binary classification. Given a number of elements all with certain characteristics (features), we want to build a machine learning model to identify people affected by type 2 diabetes.

To solve the problem we will have to analyse the data, do any required transformation and normalisation, apply a machine learning algorithm, train a model, check the performance of the trained model and iterate with other algorithms until we find the most performant for our type of dataset.

# We read the data from the CSV filedata_path=os.path.join(DATASET_PATH,'pima-indians-diabetes.csv')dataset=pd.read_csv(data_path,header=None)# Because thr CSV doesn't contain any header, we add column names # using the description from the original dataset websitedataset.columns=["NumTimesPrg","PlGlcConc","BloodP","SkinThick","TwoHourSerIns","BMI","DiPedFunc","Age","HasDiabetes"]

The correlation matrix is an important tool to understand the correlation between the different characteristics. The values range from -1 to 1 and the closer a value is to 1 the bettere correlation there is between two characteristics. Let's calculate the correlation matrix for our dataset.

In [87]:

corr=dataset.corr()corr

Out[87]:

NumTimesPrg

PlGlcConc

BloodP

SkinThick

TwoHourSerIns

BMI

DiPedFunc

Age

HasDiabetes

NumTimesPrg

1.000000

0.129459

0.141282

-0.081672

-0.073535

0.017683

-0.033523

0.544341

0.221898

PlGlcConc

0.129459

1.000000

0.152590

0.057328

0.331357

0.221071

0.137337

0.263514

0.466581

BloodP

0.141282

0.152590

1.000000

0.207371

0.088933

0.281805

0.041265

0.239528

0.065068

SkinThick

-0.081672

0.057328

0.207371

1.000000

0.436783

0.392573

0.183928

-0.113970

0.074752

TwoHourSerIns

-0.073535

0.331357

0.088933

0.436783

1.000000

0.197859

0.185071

-0.042163

0.130548

BMI

0.017683

0.221071

0.281805

0.392573

0.197859

1.000000

0.140647

0.036242

0.292695

DiPedFunc

-0.033523

0.137337

0.041265

0.183928

0.185071

0.140647

1.000000

0.033561

0.173844

Age

0.544341

0.263514

0.239528

-0.113970

-0.042163

0.036242

0.033561

1.000000

0.238356

HasDiabetes

0.221898

0.466581

0.065068

0.074752

0.130548

0.292695

0.173844

0.238356

1.000000

I'm not a doctor and I don't have any knowledge of medicine, but from the data I can guess that the greater the age or the BMI of a patient is, the greater probabilities are the patient can develop type 2 diabetes.

Visualising the data is an important step of the data analysis. With a graphical visualisation of the data we have a better understanding of the various features values distribution: for example we can understand what's the average age of the people or the average BMI etc...

We could of course limit our inspection to the table visualisation, but we could miss important things that may affect our model precision.

An important thing I notice in the dataset (and that wasn't obvious at the beginning) is the fact that some people have null (zero) values for some of the features: it's not quite possible to have 0 as BMI or for the blood pressure.

How can we deal with similar values? We will see it later during the data transformation phase.

We have noticed from the previous analysis that some patients have missing data for some of the features. Machine learning algorithms don't work very well when the data is missing so we have to find a solution to "clean" the data we have.

The easiest option could be to eliminate all those patients with null/zero values, but in this way we would eliminate a lot of important data.

Another option is to calculate the median value for a specific column and substitute that value everywhere (in the same column) we have zero or null. Let's see how to apply this second method.

In [90]:

# Calculate the median value for BMImedian_bmi=dataset['BMI'].median()# Substitute it in the BMI column of the# dataset where values are 0dataset['BMI']=dataset['BMI'].replace(to_replace=0,value=median_bmi)

In [91]:

# Calculate the median value for BloodPmedian_bloodp=dataset['BloodP'].median()# Substitute it in the BloodP column of the# dataset where values are 0dataset['BloodP']=dataset['BloodP'].replace(to_replace=0,value=median_bloodp)

In [92]:

# Calculate the median value for PlGlcConcmedian_plglcconc=dataset['PlGlcConc'].median()# Substitute it in the PlGlcConc column of the# dataset where values are 0dataset['PlGlcConc']=dataset['PlGlcConc'].replace(to_replace=0,value=median_plglcconc)

In [93]:

# Calculate the median value for SkinThickmedian_skinthick=dataset['SkinThick'].median()# Substitute it in the SkinThick column of the# dataset where values are 0dataset['SkinThick']=dataset['SkinThick'].replace(to_replace=0,value=median_skinthick)

In [94]:

# Calculate the median value for TwoHourSerInsmedian_twohourserins=dataset['TwoHourSerIns'].median()# Substitute it in the TwoHourSerIns column of the# dataset where values are 0dataset['TwoHourSerIns']=dataset['TwoHourSerIns'].replace(to_replace=0,value=median_twohourserins)

I haven't transformed all the columns, because for some values can make sense to be zero (like "Number of times pregnant").

Now that we have transformed the data we need to split the dataset in two parts: a training dataset and a test dataset. Splitting the dataset is a very important step for supervised machine learning models. Basically we are going to use the first part to train the model (ignoring the column with the pre assigned label), then we use the trained model to make predictions on new data (which is the test dataset, not part of the training set) and compare the predicted value with the pre assigned label.

# Separate labels from the rest of the datasettrain_set_labels=train_set["HasDiabetes"].copy()train_set=train_set.drop("HasDiabetes",axis=1)test_set_labels=test_set["HasDiabetes"].copy()test_set=test_set.drop("HasDiabetes",axis=1)

One of the most important data transformations we need to apply is the features scaling. Basically most of the machine learning algorithms don't work very well if the features have a different set of values. In our case for example the Age ranges from 20 to 80 years old, while the number of times a patient has been pregnant ranges from 0 to 17. For this reason we need to apply a proper transformation.

To compare multiple algorithms with the same dataset, there is a very nice utility in sklearn called model_selection. We create a list of algorithms and then we score them using the same comparison method. At the end we pick the one with the best score.

In [99]:

# Import all the algorithms we want to testfromsklearn.linear_modelimportLogisticRegressionfromsklearn.neighborsimportKNeighborsClassifierfromsklearn.naive_bayesimportGaussianNBfromsklearn.svmimportSVCfromsklearn.svmimportLinearSVCfromsklearn.ensembleimportRandomForestClassifierfromsklearn.treeimportDecisionTreeRegressor

# Prepare an array with all the algorithmsmodels=[]models.append(('LR',LogisticRegression()))models.append(('KNN',KNeighborsClassifier()))models.append(('NB',GaussianNB()))models.append(('SVC',SVC()))models.append(('LSVC',LinearSVC()))models.append(('RFC',RandomForestClassifier()))models.append(('DTR',DecisionTreeRegressor()))

In [102]:

# Prepare the configuration to run the testseed=7results=[]names=[]X=train_set_scaledY=train_set_labels

In [103]:

# Every algorithm is tested and results are# collected and printedforname,modelinmodels:kfold=model_selection.KFold(n_splits=10,random_state=seed)cv_results=model_selection.cross_val_score(model,X,Y,cv=kfold,scoring='accuracy')results.append(cv_results)names.append(name)msg="%s: %f (%f)"%(name,cv_results.mean(),cv_results.std())print(msg)

The default parameters for an algorithm are rarely the best ones for our dataset. Using sklearn we can easily build a parameters grid and try all the possible combinations. At the end we inspect the best_estimator_ property and get the best ones for our dataset.

# Create an instance of the algorithm using parameters# from best_estimator_ propertysvc=grid_search.best_estimator_# Use the whole dataset to train the modelX=np.append(train_set_scaled,test_set_scaled,axis=0)Y=np.append(train_set_labels,test_set_labels,axis=0)# Train the modelsvc.fit(X,Y)

# We create a new (fake) person having the three most correated values highnew_df=pd.DataFrame([[6,168,72,35,0,43.6,0.627,65]])# We scale those values like the othersnew_df_scaled=scaler.transform(new_df)

In [115]:

# We predict the outcomeprediction=svc.predict(new_df_scaled)

In [117]:

# A value of "1" means that this person is likley to have type 2 diabetesprediction

We finally find a score of 76% using SVC algorithm and parameters optimisation. Please note that there may be still space for further analysis and optimisation, for example trying different data transformations or trying algorithms that haven't been tested yet. Once again I want to repeat that training a machine learning model to solve a problem with a specific dataset is a try / fail / improve process.

First of all I need to thank my wife Dr Daniela Ceccarelli Ceccarelli for helping me to validate this experiment and for checking I didn't write anything wrong from a medical point of view. Then I want to thank Dr. Jason Brownlee for his fantastic blog which has helped me a lot to understand many concepts used here. I strongly advise you to have a look at his blog: https://machinelearningmastery.com