Monthly Archives: March 2017

Whenever I’m faced with a machine learning task, my goal on day 1 is to build an initial model. The model will without a doubt need to be tuned in days or even weeks after, but it’s good to have a starting point. In the project below, I timeboxed a machine learning inital model of about 4 hours to see how far I get along with some initial results.

A peer of mine in my Master’s program mentioned there is publicly available Medicare CMS data. I have very little knowledge of healthcare data, but thought I’d explore the data and see if there was an aspect that could be useful in buidling a model to make predictions.

The data:

2008 claims outpatient data (used this, only 1 of 20 available samples, still about 1.1 million rows of claims data)

2008 beneficiary data (used this)

2008 claims inpatient data (did not use this due to initial time constraint)

2008 presciption data (did not use this due to initial time constraint)

I identified one useful infomation to build a model on: Predict a medicare claim specific to ICD9 codes relating to diseases of circulatory system (this makes up about 11% of claims).

Identify Features from Beneficiary data (just grabbed them all to start)¶

In [17]:

features=['AGE','BENE_RACE_CD','BENE_COUNTY_CD','BENE_ESRD_IND','BENE_HI_CVRAGE_TOT_MONS','BENE_SMI_CVRAGE_TOT_MONS','BENE_HMO_CVRAGE_TOT_MONS','PLAN_CVRG_MOS_NUM','SP_ALZHDMTA','SP_CHF','SP_CHRNKIDN','SP_CNCR','SP_COPD','SP_DEPRESSN','SP_DIABETES','SP_ISCHMCHT','SP_OSTEOPRS','SP_RA_OA','SP_STRKETIA']# The name of the column for the output varaible.target='ICD9_DGNS_CD_1'

Group Target ICD9 codes from Claims data (chose Circulatory System Diseases – which is 1 of 17 ICD9 groupings)¶

fromsklearn.cross_validationimporttrain_test_splitx=df_joined_cleaned[features]y=df_joined_cleaned[target]# Divide the data into a training and a test set.random_state=0# Fixed so that everybody has got the same splittest_set_fraction=0.2x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=test_set_fraction,random_state=random_state)print('Size of training set: {}'.format(len(x_train)))print('Size of test set: {}'.format(len(x_test)))

pca=decomposition.PCA(n_components=9)print('original shape prior to PCA',x_train.shape)x_train_new=pca.fit_transform(x_train)x_test_new=pca.transform(x_test)print('new shape after to PCA',x_train_new.shape)

original shape prior to PCA (1715, 19)
new shape after to PCA (1715, 9)

Of the 40 ICD9 codes representing Circulatory diseases, my model only produced predicted for 0422, 0430, and 412, which isn’t ideal, but those three codes make us 37% of my training data. Above, I plotted, recall, precision, and f1 scores. I like using f1 score as it’s really a balance of recall & precision (what portion of true-positives is your model getting and how good it is at predicting true-positives). At this point, much more investigating the data and tweaking the models is needed to improve performance. Gaining domain knowledge is this field would certaining help too!

The data is unbalanced, and if I would have just guessed code 412 for all instances, my Recall rate would have increase, but then my Precision and f1 would have dropped.

This was a “quick and dirty” model building exercise, which didn’t produce great results, but is a good starting point. Rarely are you going to get great results with limited amount of work.

Overall, there is a some opportunity here, but would take many more iterations of model tuning. I would recommend bringing in the Drug Prescription data source, along with a couple more years of claims data so health trends by patient could be leveraged.