Friday, February 20, 2015

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]>age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]>formula <- Age ~ Title + Sex + Fare + HasCabin>rp_fit <- rpart(formula, data=age_train, method="class")>PredAge <- predict(rp_fit, newdata=age_test, type="vector")>table(PredAge)

8 154 101 Meaning that values 2, 23 and 25 were predicted for age variable for 8, 154 and 101 records respectively.Furthermore, instead of providing fixed ranges for AgeGroups by judgement, we will use k-means clustering to derive age groups. The commands below will create 7 clusters of Age variable, the second like assigns each record in dataset a numeric cluster ID.> k <- kmeans(dataset$Age, 7)

> dataset$AgeGroup <- k$cluster

Let's have a peek at the centers of these clusters as well as their distribution:> k$centers [,1]1 48.7086612 16.8201443 62.1525424 22.5591725 37.4494956 27.429379

Hurrah! We hiked to 310th position with 80.383% accuracy. (Note that the ranks get improved with time as competition slows)We'll end our data pre-processing here. Next, we will try some more classification models like Random forests and Support vector machines and see if we can do any better than this.