Why use signmoid function when it becomes 1 for small positive numbers (same goes for negative numbers and 0).
While computing cost, I was getting Inf(infinite) values, this was because x %*% theta was generating positive numbers and sigmoid of hypothesis h = sig(x %*% theta) is making them 1. And this 1 make problem in log(1-h) part of cost function. And these positive numbers are not large enough, sig(20) is giving me 1 with options(digits = 7)

For problem 1, I found a solution here, which state to standardize the data. While this seems to be working in my case, intuitively isn't it just going to fail for some outlier? This outlier could be in the training part or even occur in the un-seen test data (whose mean and sd I haven't used for standardization). So is standarization or normalization really a ideal solution for this problem?

Also while standardization or normalization we make the independent variable corresponding to theta_zero of design matrix X equal to 0 which leads to theta_zero always coming out as 0, inefficient?

1 Answer
1

Why use sigmoid function when it becomes 1 for small positive numbers (same goes for negative numbers and 0)

No matter how big or small the numbers are in the design matrix the final predicted outcome should either be 1 or 0 if the classification in question is binary. This is the basic intuition behind using the logistic function as the squashing function. because if the target values are binary and the predicted values are not bounded then the values generated as such would not serve a purpose. In its current state the value generated by the logistic function gives the probability of belongingness to the positive class in accordance with the maximum likelihood estimate.

So is standardisation or normalisation really a ideal solution for this problem?

If we consider any machine learning algorithm, it would not be accommodating for all its outliers. There is always a chance of failure in the inferences drawn from the training data. So the intrinsic robustness of a model is upto a degree dependent on the the training regime. So yes normalisation is a ideal solution in this case but it does not guarantee the immunity from a unseen rogue outlier.

Normalisation and feature scaling also brings in more benefits along by removing skewness among the dimensions which would also improve the convergence if using an algorithm like gradient descent etc.reference

Also while standardization or normalization we make the independent variable corresponding to theta_zero of design matrix X equal to 0 which leads to theta_zero always coming out as 0, inefficient?

The assumption that theta_zero is set to zero is incorrect. If we look at the two curve fittings here both of them (first is not normalized second is normalized) end up with a non-zero theta_zero.

$\begingroup$As per the cost function, if the hypothesis predicts 1, log(1-h) transforms to log(0) hence the error I am getting. What workarounds does the modern machine learning libraries implement for that.$\endgroup$
– MohitNov 25 '17 at 20:37

$\begingroup$Hypothesis would not predict value 1 if the features are scaled appropriately. Mathematically h(z) = 1 / (1 + e^(-z)) where z = Ax + b if h(z) = 1 then, e^(-z) = 0, which does not have a solution Hence problem would be solved.$\endgroup$
– m1cro1ceNov 25 '17 at 21:40