In this R tutorial, we will be analyzing and visualizing medical care expenses. For this tutorial, we will be using the lm() package to fit a linear regressional model to data in R.

Health insurance companies must make money to stay afloat. In order to do so, it must collect more in yearly premiums than it spends on medical care for its beneficiaries. Investors invest money and time to develop models that will accurately forecast medical expenses for the insured population.

The goal of this tutorial is to analyze patient data to estimate the average medical care expense for such population segments. These estimated can be used to create actuarial tables that set the price of patients yearly premiums, which may be high or low, depending on the expected treatment costs.

Install and Load Packages

Below are the packages and libraries that we will need to load to complete this tutorial:

1

2

3

4

5

6

7

8

9

10

>install.packages("C50")

>install.packages("gmodels")

>install.packages("party")

>install.packages("RColorBrewer")

>install.packages("psych")

>library(C50)

>library(gmodels)

>library(party)

>library(RColorBrewer)

>library(psych)

Download and Load the Medical Care Expense Dataset

Since we will be using the medical care expenses dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from here Medical Care Expenses – insurance.csv

1

>insurance<-read.csv("insurance.csv",stringsAsFactors=FALSE)

View the Medical Care Expense Dataset

head() function

Now, let’s take a look at the data and see by using the command head():

1

>head(insurance)

1

2

3

4

5

6

7

age sex bmi children smoker region expenses

119female27.90yes southwest16884.92

218male33.81no southeast1725.55

328male33.03no southeast4449.46

433male22.70no northwest21984.47

532male28.90no northwest3866.86

631female25.70no southeast3756.62

As one can see from the above, the data provided for each person is age, sex, BMI, children, smoker, region, and expenses.

lm() function

The method that will be used for insurance data is lm(). An lm() sample is provided below:

1

>ins_model<-lm(charges~age+sex+smoker,data=insurance)

The above will be used to fit a linear regression model to data with R. The R formula syntax uses the tilde character to describe the model; the dependent variable ‘expenses’ goes to the left of the tilde character, while the independent variables go to the right, separated by the + sign. Using the ., will add all independent variable to the right of the tilde sign as shown below:

1

>ins_model<-lm(charges~.,data=insurance1)

Histogram of expenses

By creating the histogram, the expenses range from $1122 to $63770, but most of the expense range from $0 and $15000.

Correlation Matrix for age, BMI, children, and expenses

1

>cor(insurance[c("age","bmi","children","expenses")])

1

2

3

4

5

age bmi children expenses

age1.00000000.109341010.042469000.29900819

bmi0.10934101.000000000.012644710.19857626

children0.04246900.012644711.000000000.06799823

expenses0.29900820.198576260.067998231.00000000

Scatterplot Matrix Using pairs()

1

>pairs(insurance[c("age","bmi","children","expenses")])

Scatterplot Matrix Using pairs.panels()

The below scatterplot is using pairs.panels(). This will give a better visual of the data and as you can see below above the diagonal, the scatterplots have been replaced with a correlation matrix/ On the diagonal, a histogram depicting the distribution of values for each feature is shown. Also, the scatterplots below the diagonal are now presented with additional visual information.

Correlation Ellipse

The oval-shaped object on each scatterplot is called a correlation ellipse. This shows a visual of the correlation strength. The dot at the center of the correlation ellipse indicates the point at the mean values for the x and y-axis variables. The correlation between the two variables is indicated by the shape of the ellipse is shown by the shape more stretched, the stronger the correlation. When there’s a perfect rounded oval, such as the BMI and children, this indicated a very weak correlation.

1

>pairs.panels(insurance[c("age","bmi","children","expenses")])

lm() function

The lm() function will return a regression model object that can be used to make predictions. Interactions between independent variables can be specified using the * operator.

1

2

3

4

>ins_model<-lm(expenses~age+sex+smoker

+region,data=insurance)

>ins_model<-lm(expenses~.,data=insurance)

>ins_model

1

2

3

4

5

6

7

8

Call:

lm(formula=expenses~.,data=insurance)

Coefficients:

(Intercept)age sexmale bmi children

-11941.6256.8-131.4339.3475.7

smokeryes regionnorthwest regionsoutheast regionsouthwest

23847.5-352.8-1035.6-959.3

Improving Model Performance

The below break down for the lm() and using summary() is showing residuals and coefficients. The residuals section provides a statistical summary for the prediction errors. Some of these errors are quite substantial. As one can see, 50 percent of the errors fall into the 1Q and 3Q values. This means that predictions were between $2850.90 over the true value and $1383.90 were under the true value.

For each of the estimated regression coefficient, the p-value provides an estimate of the probability that the true coefficient is given zero given the value of the estimate. The below model we created, has several highly significant variables, and they seem to be related to the outcomes in logical ways.

The coefficient of determination, also known as the multiple R-squared value, provides a measure of how well our model as a whole explains the values of the dependent variable. This means that the closer the value is to 1.0, the better the model explains the data. As one can see the below data, the adjusted R-squared data is 0.7509, this shows us that the 75 percent of the variation is in the dependent variable.

1

>summary(ins_model)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Call:

lm(formula=expenses~.,data=insurance)

Residuals:

Min1QMedian3QMax

-11302.7-2850.9-979.61383.929981.7

Coefficients:

Estimate Std.Errortvalue Pr(>|t|)

(Intercept)-11941.6987.8-12.089<2e-16***

age256.811.921.586<2e-16***

sexmale-131.3332.9-0.3950.693255

bmi339.328.611.864<2e-16***

children475.7137.83.4520.000574***

smokeryes23847.5413.157.723<2e-16***

regionnorthwest-352.8476.3-0.7410.458976

regionsoutheast-1035.6478.7-2.1630.030685*

regionsouthwest-959.3477.9-2.0070.044921*

---

Signif.codes:0‘***’0.001‘**’0.01‘*’0.05‘.’0.1‘’1

Residual standard error:6062on1329degrees of freedom

MultipleR-squared:0.7509,AdjustedR-squared:0.7494

F-statistic:500.9on8and1329DF,p-value:<2.2e-16

The next question, How can we improve model performance?

The key difference between regression modeling and other machine learning approaches is that regression typically leaves feature selection and model specification to the user.

Adding non-linear relationships

The below will add age and age2 to the lm() formula using expenses ~ age + age2 form. This will allow the model to separate the linear and non-linear impact of age on medical expenses.

Converting a numeric variable to a binary indicator

A feature called ifelse() will be used to convert a numeric variable to a binary indicator.

1

>insurance$bmi30<-ifelse(insurance$bmi>=30,1,0)

Putting it all together

From the above model improvements, we can view improvement in the regression formula:

Adding a non-linear term for age

Created BMI index of 30

Specified an interaction between BMI and smoking (expenses ~ bmi30*smoker)

As one can see from the above performance, the adjusted R-squared has improved greatly from the previous 0.75 to 0.87. This states that 87 percent of the variation of medical treatment costs. The higher-order age 2 term is statistically significant, as is the obesity indicator, bmi30. The interaction between obesity and smoking suggests a massive effect; in addition to the increased costs over $13,304 for smoking alone, obese smokers spend another $19,810 per year. This suggests that smoking exacerbates diseases associated with obesity.