$\begingroup$Yes; whatever minimizes the sum of squared residuals also minimizes the mean (average) of the squared residuals (better wording than the mean of the sum of the squared residuals, as there is just one sum, and the mean of that sum is just that sum). But no-one, so far as I know, claims that it is necessary to treat the sum of squared residuals as the only minimand, for this reason if not others. Indeed in simple linear regression we never minimize either quantity directly.$\endgroup$
– Nick CoxFeb 12 '16 at 11:06

$\begingroup$@Nick Cox - Surely the sum is a different number than the mean? (excuse my ignorance) the mean of 2 and 1 is 1.5 yet the sum is 3$\endgroup$
– Conor CosnettFeb 12 '16 at 11:11

$\begingroup$Also what do you mean when you say: "minimise either quantity directly"...$\endgroup$
– Conor CosnettFeb 12 '16 at 11:20

2

$\begingroup$Sure; but the point is that doesn't bite. You are just changing the units in which you work, but the position of the minimum in parameter space is the same. The problem is analogous to this. Which $x$ minimises $(x - 2)^2$? Clearly $2$. Now which $x$ minimises $k (x -2)^2$ for any positive $k$? It's the same answer. Otherwise put, a positive multiplicative constant doesn't change the shape of a surface.$\endgroup$
– Nick CoxFeb 12 '16 at 11:20

3

$\begingroup$You might need to ask him why he explains it this way and/or read more of his materials. There could be any number of good reasons. One that I regard as standard is that the residual variance is more interesting and useful than the sum of squared residuals (SR), which clearly depends on how many observations there are. Thus the sum of SR for 200 observations will, other things being equal, be about twice that for 100 observations, but in comparing fits that's secondary. In fact, it's the same reason as why the sums of 200 human heights or of 100 are less useful than their means.$\endgroup$
– Nick CoxFeb 12 '16 at 11:43

1 Answer
1

The reason is that if you do not take the mean, in gradient descent algorithm, you would need to choose a very small learning rate (like 0.00000001) in most cases and it will need to be modified depending on the number of examples in the data set. But when you take the mean than most of the time a learning rate around 0.01 will work. In this way you work with a reasonable sized learning rate and its magnitude will be independent of the size of the data set.