This walk-through is a project I've been working on for some time to help improve the missed opportunity rate (no-show rate) for medical centers. It demonstrates the entire machine learning process starting with extracting datasets from an SQL server and loading them directly into your R environment, performing exploratory analyses and creating visualizations to gain insight, engineering new features, model tuning and training, and finally measuring the model's performance. Finally, a web application is demonstrated at the end of this article which I developed to present the patient prediction results to the agency's leaders and frontline staff. I would like to share my results and methodology as a guide to help others starting their project or to assist others with improving upon my results.

If you enjoy this article, please leave a comment at the end if you have any questions about my process or if you have any suggestions. Thank you for reading!

Patients that do not show up for their appointment without advanced warning is an issue that negatively affects the Bronx VAMC's 191,000+ veteran patient population and the facility because each no-show lowers access (increases wait times) and is a missed billing opportunity. This is a substantial problem because the facility's average 17.5% no-show rate, which consists of roughly 450,000 scheduled appointments each year, has never dropped below the national 10% goal. This project is an attempt to lower the number of patients that no-show by using the XGBoost implementation of the gradient boosting machine learning algorithm. This project is on-going as new features are continuously being tested to increase the model's accuracy.

The first step is to install and load the following R packages which are required to perform each step in this project. Package descriptions were sourced from their respective reference manual which can be found at https://cran.r-project.org/web/packages/.

In [2]:

library(RODBC)# Used to establish a connection with Microsoft Serverlibrary(plyr)# Tools for splitting, applying and combining datalibrary(dplyr)# Functions to simplify data manipulationlibrary(data.table)# Functions to simplify data manipulationlibrary(xts)library(zoo)# Required to make a zoo objectlibrary(caret)# Functions to streamline the model training process for complex classification problemslibrary(e1071)# Misc. statistics functions required for caret packagelibrary(Matrix)# Creates sparse matriceslibrary(xgboost)# An efficient and scalable implementation of gradient boosting frameworklibrary(Metrics)# Evaluation metrics for machine learninglibrary(ROCR)# Used to visualize the performance of scoring classifierslibrary(pROC)# Used to display and analyze ROC curveslibrary(sp)# Classes and methods for spatial datalibrary(ggmap)# Create spatial visualizations with ggplot2library(ggplot2)# Create elegant data visualizations using the grammar of graphicslibrary(doParallel)# Provides parallel processing
n.cores <- detectCores()# Identify the machine's number of cores
registerDoParallel(n.cores)# Used to train model across multiple cores in parallel
n.cores <- getDoParWorkers()paste(n.cores,'workers utilized')

All database table and column names have been given aliases for security reasons. In this next step, we will gather a period of two years of historical appointment information as well as patient demographic information from VHA's Corporate Data Warehouse. We will connect R directly to Microsoft SQL Server via an ODBC connection using the RODBC package. We will use Structured Query Language (SQL) to pull the information from 11 tables. We will set three variables; start.date, end.date, and station, which allows us to reuse the same dates and facility number across each query.

We will store each table in its own data frame and combine them later because R provides us more control joining datasets with the dplyr package instead of SQL. Also, it makes writing SQL statements much easier when writing many small queries instead of one large complex and slow SQL query.

Now that the raw datasets are stored in separate data frames, the next step is to clean and combine them. This step also requires a lot of domain specific knowledge. For example, we will remove all deceased patients because they will no longer be scheduling appointments and will negatively affect the model if left in. Also, we will remove patients flagged as test patients in the database. We will also remove clinics where no-show performance is not monitored. Next, we'll remove inpatient appointments and appointments with data entry errors. Lastly, even though appointments canceled by the facility after the appointment date/time count toward the no-show rate, we'll remove those because they are out of the patient's control.

We dropped 177,823 observations that would have otherwise negatively affected our model.

Next, we'll combine the homeless, subabuse, and lowincome data frames to our working data frame d. We'll remove duplicate patient ID records by calling subset(x, !duplicated(PatientID)) on each data frame because each patient may have multiple entries, but we only need one instance for each unique patient. Then we assign them to their matching patient IDs in our working data frame.

Let's create our Label feature. This is the feature that we are trying to predict. It may also be called the outcome variable or the dependent variable. For this project, since we are trying to predict patients that will no-show, we'll assign a 1 to every instance that the patient either no-showed or canceled after their appointment date/time. We'll assign 0 to every case where the patient did show up. We want to train our model to understand which features contribute to every case where our Label feature is equal to 1.

This plot shows us that most patients, 86.02%, show up for their appointment while 13.98% no-show. For this project, we are most interested in predicting which patients are most likely to no-show rather than which patients will likely show.

Feature engineering is the process of creating new features that help the learning algorithm understand the problem. Features in machine learning are analogous to independent variables in statistical experiments.

VHA groups similar services into departments to determine if a patient is new to the department or if the patient is returning for a follow-up visit. We will use VHA's national guidelines to group the services together, which will also reduce the number of factors or categorical variables in this feature from 120 down to 51. Our goal is to find out if patients are not showing up for some departments more than others. Each service is assigned a 3-digit identifier that we'll use when assigning them to groups.

Okay, now that we have our departments grouped, let's see if some departments have more no-shows than others. It's important to view department no-shows by rate rather than by total no-shows because some departments are much larger than others. Looking at it simply by total no-shows, the Mental Health department would seem like they have the most no-shows. However, they have more no-shows because they have a much higher volume of appointments than any other department. That would not indicate whether there is some factor causing patients to no-show in Mental Health more than other departments. Let's find each department's no-show rate and plot them from highest to lowest.

In [9]:

dep <-data.frame(table(d$Department, d$Label))# Create a data frame that aggregates and split the total shows and no-shows for each department
dep.no.shows <-subset(dep, Var2 ==1)# Create data frame with service's no-show aggregates
dep.shows <-subset(dep, Var2 ==0)# Create data frame with service's show aggregates # Because there may be some departments that do not have no-shows, we cannot simply join the Freq column from dep.shows to dep.no.shows because the length of the columns may differ
dep.no.shows$Shows <- dep.shows$Freq[match(dep.no.shows$Var1, dep.shows$Var1)]# We use the match function to assign the show values from only those that are in the dep.no.shows data frame
dep.no.shows$Rate <- dep.no.shows$Freq /(dep.no.shows$Freq + dep.no.shows$Shows)
dep.no.shows$Lab <-paste(round(dep.no.shows$Rate *100,1),'%', sep='')
ggplot(data=dep.no.shows, aes(x=(reorder(Var1, Rate)), y=Rate))+
ggtitle("Department No-show Rates")+
xlab("Departments")+
geom_bar(stat="identity", fill="#FF9999")+
geom_text(data=dep.no.shows, aes(x=Var1, y=Rate, label=Lab), vjust=0.35, hjust=1.05, size=2.5, color="#000000")+
coord_flip()# flip the position of the x and y axesrm(dep, dep.no.shows, dep.shows)# Remove temp data frames

We can see that the emergency department has an extremely low no-show rate, which makes sense. However, it's the 9th largest producer of appointments out of 51 departments. That's significant because it means patients scheduled in the Emergency department will most likely always show up. We'll see later that the model realizes this fact as well. We can also use this information to find best practices from departments that have low no-show rates with high volume such as Prosthetics.

Our training dataset will consist of 12 months of data, but our features will require us to use data during the period 12 - 24 months. We can drop the 12 - 24-month period after we create them.

The first feature will determine if the patient has not visited the same department within a 24-month period, and thus be considered a new patient. We'll do this by first creating a new column that concatenates each patient's unique ID and their scheduled appointment's department only if they had a checked-out visit. Then, we'll loop over the column and assign a Yes if no duplicates are found and No if it does find one or more duplicates.

The next feature we'll create is the number of days that are between each patient's past most recent appointment and the current appointment. We want to find out if a long time has elapsed since the patient has last visited the medical center.

Next, let's reduce the amount of married and non-married factors down to just married or single. The data has several types of non-married such as divorced, single, and widow. We are most interested in only married and single because we want to find out which patients have a loved one at home that can help them.

In [14]:

# Patient's marital status
d$MaritalStatus <-ifelse(grepl('MARRIED', d$MaritalStatus),'M','S')# Code values with married as 'M' and all the others as 'S'

For the next feature, we'll assign each patient into one of n age categories. We'll calculate n by using k-means clustering. Then we'll assign the centers of each cluster to each patient to create our factors.

The next feature, the sum of each patient's appointments within 365 days, will be a feature on its own and be used to calculate other features, such as each patient's cancelation rates and no-show rates. First, we must sum each patient's appointments grouped by their unique ID. We'll use data.table's functions to create a new aggregate feature in our data frame with the sums for each unique patient ID. We'll use this same technique for several other features as well.

Next, we'll sum the total number of consults each patient had in 365 days. A consult is a digital request when a patient sees a specialist for the first time. Follow-up visits with the same specialist do not require consults. The dataset contains three values; NA, -1, and a value greater than 0. All values greater than 0 are the consult request ID. We'll transform all values less than 0 or NA to 0, and every other value to 1.

Next, we'll calculate each patient's no-show rates by department. First, we'll concatenate each patient's ID and the department to make a new DepartmentID. Then, we'll sum the no-shows and total appointments based on the new DepartmentID we created. Next, we'll use that ID to assign our aggregates back to our working data frame. Once we have the total no-shows and total appointments by our new ID, we can then divide them to get the no-show rate.

Next, we'll use the same DepartmentID and ClinicID to find each patient's cancelation rate by department and clinic. This is the total number of times the patient acted responsibly by canceling their appointment before their scheduled date/time relative to the total amount of appointments they've had.

Our next feature will be the total number of appointments each patient has on the same date as the predicted appointment. We'll use the same technique by creating a new ID based on the patient's ID and the appointment date. Then we'll sum by the new PatientApptDateID.

In [ ]:

# Count sum of appointments in each day
d$PatientApptDateID <-paste(d$PatientID,format(d$AppointmentDateTime, format='%m%d%Y'), sep='')# Create Unique ID with patient's ID and appointment date# Sum appointments on the same day using our created PatientApptDateID
d[, ApptSumPerDay :=sum(Appt), by=PatientApptDateID]

Next, we'll fill in missing values in the AppointmentLength feature with 'None'. We'll fill in missing values in the Homeless, SubstanceAbuse, and LowIncome columns with 0's and transform all others into 1's.

Next, let's look to see if patients are more likely to no-show based on their home address' geolocation. If so, we can create a new categorical feature that groups patients into different geo areas. We'll use the ggmap package to plot patient's home addresses' longitude and latitude coordinates on top of a map of New York. We'll color the shows in blue and the no-shows in red to differentiate between the two groups.

First, we're going to separate the no-shows from shows and save them in two different data frames. Then, we'll call the ggmap() function and pass in our longitude and latitude vectors from the no-shows' data frame.