Wednesday, January 7, 2015

Titanic: A case study for predictive analysis on R (Part 1)

Kaggle.com is a popular community of data scientists, which holds various competitions of data science. The article performs predictive analysis on a benchmark case study -- Titanic, picked from Kaggle.com -- in-depth.The case study is a classification problem, where the objective is to determine which class does an instance of data belong to. This can also be called prediction problem, because we are predicting class of a record based on its attributes.Note: This tutorial requires some basic R programming background. If you haven't yet gotten yourself acquainted with R, maybe this is the right time. Codeacademy's tutorial is my personal recommendation. We will be using RStudio here, the most used IDE for 'R' language. It is free and open-source, you can download it here.Dataset:RMS Titanic was a British cruise that sank on its course in the North Atlantic Ocean on its maiden voyage. 1502 people, out of 2224 on board lost their lives in this disaster. Due to lack of lifeboats, the death toll was so high. When the data was gathered about the passengers that survived or killed, it was observed that some people, like women, children and those belonging to upper-class survived more than the others. Our objective is to depict the attributes of the people who survived with as much accuracy as possible.We have a set of 2 data files: train, which contains records of 891 passengers with label Survived = 0 (did not survive) or 1 (survived); the other one is test, which has 418 records without the information of survival. We want to predict this and post to Kaggle.com to check how accurately we were able to predict this label.Please go on and download the two files (train.csv and test.csv) from here. You can alternatively download them from this link as well. The first file, train.csv contains records in which the value of target variable Survived is given; we will use this to generate a classification model, to be used for prediction. The other file, test.csv contains data about different passengers, but the information that the passenger survived or not is not provided. We will apply our learnt model on this data to see which passengers are predicted to live or perish.That'll be all about the case study, let's start the real thing.

Reading dataFirst, set the working directory to wherever you have placed the downloaded files.> setwd("D:/Datasets/Titanic")read.csv() funtion can be used to read a CSV file in table format and save in a data frame > dataset <- read.csv("train.csv")Check what variables are available.> colnames(dataset)

PassengerID is a unique identifier of patients; Survived is the target variable; Pclass represents in which class the Passenger traveled; Name, Sex and Age are the demographics; SibSp is the count of siblings and spouse on board; Parch is the count of parents and children on board; Ticket is the ticket number; Fare is the amount paid as fair; Cabin is the cabin numbers reserved for the passenger; Embarked tells which port did the passenger get on board from.Let's now have a sneak peek of how the data looks like using head() function :

We can clearly notice a few things here. First, some data is available in ready-to-use form, like Sex; some fields like Age and Cabin are missing values; variables like Name can be fine-tuned to make them meaningful, or derive some information from them.All of this is Pre-processing task, in which we clean-up the data, fill in missing values, select and deselect fields and derive new variables before running analysis algorithms.Base package of R contains a very handy function, summary() that tells some primary statistics about the data.

There are 314 females and 477 males in the training data; most passengers on board were 28 years old, while the age is not available for 177 passengers; maximum fare paid is 512.33 pounds. You can get get an overall picture of the data just by using this command. Have a keen look.Submitting on Kaggle:In order to evaluate our model, we will have to submit our results on Kaggle.com by predicting value of Survived variable for each passenger in the test.csv file.We know that during disasters, women and children are given preference over males and adults. We first see what our training data tells us about this.Run table() function on variables Survived and Sex to create a distribution table over these two variables:

As observed, the probability of female-survived is 0.26, while probability of male-died is 0.525. Which means, we can tell with some confidence that females survive and males do not.Let's test this hypothesis on test data. Read test data in another data frame and introduce a new column, Survived with 0 as default value.

Next, we set Suvived = 1 for all females. The command belowsays: "set variable Survived of testset to 1 wherever value of variable Sex is female".> testset$Survived <- 0> testset$Survived[testset$Sex == 'female'] <- 1Now, we will export PassengerID and Survived variables into a new file.> submit <- data.frame(PassengerId=testset$PassengerId, Survived=testset$Survived)> write.csv(submit, file="all_femailes_survive.csv", row.names=FALSE)Let's submit this new file to Kaggle.com's leaderboard in Titanic competition.Go to "Make a submission" page in Titanic Disaster challenge and Accept the terms and conditions. The next page will ask whether you want to take part as a team or as an individual. You can choose individual (you can later add people to make a team).On the submission page, browse for the newly created file "all_families_survive.csv" and submit.

The results take some time to compute on Kaggle...... there you go!

This just means we have achieved 76.55% accuracy. This, only using Gender attribute.In our next attempts, we will try to improve our accuracy by remaining data.Have fun till then...