Mathematician in Data Science

Sunday, June 25, 2017

Neural networks are considered complicated and they are always explained using neurons and a brain function. But we do not need to learn how to brain works to understand Neural networks structure and how they operate.

Let us start with logistic regression. Recall that a logistic regression divides 2 sets by a line (or a hyperplane if we have higher dimensions)

The logistic regression yields values form 0 to 1, and we can consider the process as making an evaluation. In the process we get data and we calculate our evaluation by a formula.

For example we may have the following assignment: to compute if we have enough goods in storage to last for a week of sales. This is quite a common problem, and say some clerks report their numbers to their manager to figure it out. The manager collects information, processes it and makes an evaluation.

Note that this is how a logistic regression functions.

Usually computing if an amount of goods is sufficient is not the only problem. In addition we need to know, for example, if our storage is full to optimal capacity (75% -85% or something like this). Therefore we need to evaluate another statistic.

And of course these people should report to their supervisor who will make another evaluation:

So we get a whole hierarchy of evaluations and at the end they report to CEO. We can compare it with a neural network structure:

We can observe a lot of in common with a corporation chain of command. As we see middle managers are hidden layers which do the balk of the job. We have the similar information flow and processing which is analogous to forward propagation and backward propagation.

What is left now is to explain that dealing with sigmoid function at each node is too costly so it mostly reserved for CEO level.

As we see all data columns (variables) except for the last 2 ones have numeric and
integer types.
Our target variable is
marked as "left" and it is 7th. In addition, here are no missing values, which is convenient.
We learnt that a rate of leaving the company 23.8% in accordance to this data set.

We can check pairwise plots, correlations and densities for first 8 columns.

In "salary" column we see 3 levels for salary values and it looks like "sales" column represents
10 departments. Let us see how uniformly data records are distributed between
salary levels and departments:

It is not very uniform, but at least here are a few hundred records for every salary type
and for each department. I would like to replace "salary"
column with three other columns: "high_salary", "low_salary", "medium_salary". Each will have values
0 and 1, denoting "no" and "yes", respectfully. In a package called "dummies" we find a function
which produces such columns from a text variable.

As we see our new variables may have lower correlations with our target variable "left"
than between themselves. It happens because 1) a person should have some salary, so if it is not high
or low it must be medium, 2) a person should belong to one of the departments as well. Hence
variables in these groups are correlated.

I will remove a variable in each group which yields the lowest correlation with my target variable:
"medium_salary" and "marketing".

dt$medium_salary=NULLdt$marketing=NULL

Choosing data for work

As we remember we are supposed to answer a specific question: Why are our best
and most experienced employees leaving prematurely?
Regretfully we are not
told if our data set already contains the employees, and we need to check it out.
Let us compute what are "last_evaluation" and "time_spend_company" ranges.

range(dt$last_evaluation)

## [1] 0.36 1.00

range(dt$time_spend_company)

## [1] 2 10

As we see evaluations may be rather low. Therefore our data set has other employees
which we do not need for our analysis.

I will plot
a histogram for "last_evaluation" column" and mark the median for its values as a blue
vertical line.

hist(dt$time_spend_company,col="khaki",xlab="Last Evaluation Value",main="Histogram of Time Spent with Company")abline(v=median(dt$time_spend_company),col=4,lwd=3)

median(dt$time_spend_company)

## [1] 3

We can consider as best and most experienced employees people who have evaluation above 0.72 and
spent more than 3 years with the company as
the "best and most experienced employees". We can compute what is a rate of leaving for
such employees.

dt=dt[dt$last_evaluation>0.72&dt$time_spend_company>3,]mean(dt$left)

## [1] 0.523085747

The rate of leaving was 23.8%, and now it is more than doubled. It is a cause for concern.

Let us now repeat the same pairwise plots with correlations we did at first.

Decision Tree with Result Explanation

So, what do we see on our tree graph? That majority of valuable employees work overtime: 70%. People
with normal time load tend to stay, with rate of 94%. Among those who work too many hours several
can stay if they have no more than 3.5 projects. There are a few enthusiasts who do not leave
with too much work and their satisfaction is above 0.71. But most of overworked people are
leaving, and they are the 54% of all valuable workers. Even high salary does not help. As we see the
longer overworked people stay at the company, the more they are inclined to leave.

I want to see how reliable is my result and I will split the data set into 2 sets:
for training and for testing. We will compute a decision tree model for a train set and
check what accuracy we get predicting with the model on test set.
To make sure that our accuracy number is not
accidental it makes sense to repeat it a few times. In practice it is done at least 10 times
(and called cross-validation),
but I will limit it to 4. I will set up random seeds
to make sure I have no repetitions.

par(mfrow=c(2,2))seed=c(89,132,765,4)accuracy=numeric(length=4)for(iin1:4){set.seed(seed[i])indeces=sample(1:dim(dt)[1],round(.7*dim(dt)[1]))train=dt[indeces,]test=dt[-indeces,]cart_mod=rpart(left~.,data=train,maxdepth=5)rpart.plot(cart_mod,digits=3)# In this case our predictions are returned as probabilities, # and we need 0 and 1.predictionsAs0and1=sapply(predict(cart_mod,test[,-7]),function(nu)ifelse(nu>.5,1,0))accuracy[i]=sum(test$left==predictionsAs0and1)/length(test$left)}

accuracy

## [1] 0.959847036 0.958891013 0.956022945 0.958891013

As you see our accuracy here fluctuates somewhere between 95.5% and 96%. The decision trees
change slightly as well, but not much.

If we are not limiting our predictions with methods which are to explain
to a layman then we can get better accuracy.

Random Forest Method

For random forest method we construct many trees and default option is 500 of them.
For every tree and
at each tree level we pick up only a subset of variables. We need to choose a
size for such subset and in the package below it is called "mtry".
As before to check method reliability
I will examine its work on different subsets of the data. In addition I
want to see what variables are important for the method, so I create
a data frame for "mtry" number, corresponding accuracy and for 3
most important variables.

library(randomForest)# For classification I'm to declare the target variable as factor.dt$left=as.factor(dt$left)resultTable=data.frame(N=1:4,accuracy=0,importance1=character(length=4L),importance2=character(length=4L),importance3=character(length=4L),stringsAsFactors=FALSE)for(iin1:4) {set.seed(seed[i])indeces=sample(1:dim(dt)[1],round(.7*dim(dt)[1]))train=dt[indeces,]test=dt[-indeces,]rf_model=randomForest(left~.,data=train,mtry=6,importance=TRUE)resultTable[i,"accuracy"]=sum(test$left==predict(rf_model,test[,-7]))/length(test$left)resultTable[i,3:5]=rownames(rf_model$importance)[1:3]}resultTable

We can get accuracy 98.5%-99% choosing "mtry=6". The most significant variable
is "satisfaction_level", then "last_evaluation" and "number_project". As we saw previously the fist
variable is responsible for majority of leaving workers.

Linear and Quadratic Discriminant Analysis

These methods require some specific properties of data. Our variables
are supposed to have a normal distribution, and by the look of their
histograms they do not. Thus the methods is not likely not perform well
and we see as a result.

Nearest Neighbors

With nearest neighbors we check for each test value if it is close to some values in
train set if
we consider each record as a point in multidimensional space. For this we need values for
each variable to be approximately of the same scale. We can look at our plot and see that it
is not the case. Majority of our data values are 0 and 1, but some are not.
Let us see it in more detail:

It is the worst. It means that our data do not form two
clusters which we can mark as "left" and "stayed".
You can try other kernels which represent different kind of
borders and verify that it is no help. I am not going to check
out the resulting accuracy as well.

Logistic Regression

Now let us use logistic regression. It is not likely to produce a good result, because
for it to work properly we need a number of assumptions. Still we can check it out.

seed=c(89,132,765,4)accuracy=numeric(length=4)for(iin1:4){set.seed(seed[i])indeces=sample(1:dim(dt)[1],round(.7*dim(dt)[1]))train=dt[indeces,]test=dt[-indeces,]log_reg_model=glm(left~.,data=train,family="binomial")## Values for logistic regression predictions are ## not limited to 0 and 1.predictionsAs0and1=sapply(predict(log_reg_model,test[,-7]),function(nu)ifelse(nu>0,1,0))accuracy[i]=sum(test$left==predictionsAs0and1)/length(test$left)}accuracy

## [1] 0.860420650 0.869980880 0.855640535 0.875717017

It is not one of the best as well. So our variables are not very good fit for
linear additive model. Although sometimes 85% is all you can get from data, but
here we've seen better accuracy.

Conclusion

Random Forest method yields the best accuracy, although it does not explain
much. The Decision Tree is much more useful for staff policy recommendations.
Other method mostly tell us what our data are not: we can not divide records
into 2 well defined clusters which we can name "left" and "stayed",
variables are not mutlinormally distributed, and linear additive model does not work well
predicting desired outcome.

Remark: I would like to note that I used simple variations of the methods and did not explore all options.

As you see I have a lot of repetitions in my code. There is a way to avoid it: look at
"caret" package!

Wednesday, November 30, 2016

I continue doing my ML work with MNIST set, currently presented at kaggle as Digit Recognizer competition, which I’ve started in one of my previous posts. This time I’ve decided to try neural network method. At first I had taken a look at “nnet” and “neuralnet” packages, but they could not handle such big set. Not only memory and timing had been a problem, but there are default restrictions, like only one hidden layer and a number of nodes. The number of nodes may be increased, but I got memory overload.

I googled if there is anything new for NN with R. Found two frameworks, MXNET and H2O. Decided to try H2O first, because it looked simpler.

Both packages cannot be installed using usual R command “install.packages”. For H2O you can find installation instructions on the company web site. You may get a message that JAVA installation is required. MXNET installation instructions are more complicated and you may need to adapt what you google. I posted my story with it in my blog post here.

Now let us start predicting! At first we load the data set, check dimenstions and prepare target variable.

Because I’m working with a kaggle set I’m supposed to submit my prediction for test set on their site. For this I need to load it, convert it to H2O format as well, make predictions and convert results back to R format. Aftewards I will shut down H2O instance and write a submission file.

I’m making this markdown file with RStudio, and it means that at first I need to go back to the directory where all my data are stored.