On Jan 9, 5:34 am, "David Jones" <dajh...@hotmail.co.uk> wrote:> "Ray Koopman" wrote in message>> news:fda62fa1-7a51-456e-84a7-836d994d070e@p7g2000pbz.googlegroups.com...>> On Jan 8, 11:56 pm, Paul <paul.domas...@gmail.com> wrote:>>>>>> > I did a simple linear regression (SLR) on 2 equal-length vectors of> > data, then subjected the residuals to a Normal Probability Plot> > (NPP,http://en.wikipedia.org/wiki/Normal_probability_plot). The fit was> > good, and there was no gross concavity, convexity, or "S" shape to> > indicate skew or excess kurtosis.>> > From web browsing, I found that the mean and the standard deviation of> > the normal distribution that is being tested for can be estimated by> > the y-intercept and the slope of the NPP. In other words, a 2nd SLR> > is performed on the residuals NPP scatter graph.>> > I am at a loss as to how to resolve a discrepancy. The estimate of> > *standard deviation* of the residuals comes from slope of the NPP.> > This should correspond to the standard error of the estimate from the> > SLR on the 2 vectors of data. Shouldn't it?? It doesn't. There is a> > notable error of about 5% (N=16 data points, yes I know it's a small> > sample). I don't know which is the correct result. I am using> > Excel's linear regression LINEST, but as will be clear, that's not all> > that relevant to the problem since I can read their documentation to> > ensure that they conform with textbook theory.>> > Part of this discrepancy can be explained by the fact that the formula> > for the normal order statistic means is approximate (see the NPP> > wikipedia page), but I suspect that it's not the main culprit because> > the estimated *mean* of the residuals was highly accurate (in the> > order of 1e-17, ideally zero).>> > I decided to manually calculate the estimate of the standard deviation> > for residuals. This is simply the sum of the square (SS) of the> > residuals (SSres), normalized by the DF, then square-rooted.> > According to the SLR theory, the DF should be N-2 because one degree> > is in getting the mean of the independent variable, and another is> > lost in getting the mean of the dependent variable. I manually> > verified that this is in fact what is done by Excel's SLR.>> > The alternative is to look at the NPP problem completely separately> > from SLR problem. This is simply estimating a population standard> > deviation from a sample. The sample consists for the residuals from> > the SLR problem, this fact is not used. The DF in such a process is> > N-1.>> > For the estimation of standard deviation for the residuals (not just> > for the sample, but for the whole hypothetical population), which DF> > is theoretically correct, N-1 or N-2? As a disclaimer, I should say> > that using N-1 gives a greater discrepancy from the SLR than even the> > NPP yields. So it doesn't really help dispel the discrepancy. Be> > that as it may, however, I'm still interested in what is the> > theoretically correct choice for DF.>> > P.S. I'm not interested in the Maximum Likelihood approach for the> > time being. Better to get a good understanding of the why's for one> > approach before broaching another approach.>> Even if the true regression is linear and the error random> variables are iid normal, the sample residuals are not iid.> Their joint distribution is n-variate normal with zero means> and covariance matrix = (I - H)*sigma^2, where sigma^2 is the> variance of the error distribution and H is the "hat matrix"> associated with the predictors. The expected order statistics> of the residuals are not a simple linear function of the> expected order statistics of n iid normals.>> ------------------------------------------------------------------------>> A way of understanding/improving the NPP approach in a non-regression> context is to look at L-Estimation ... the use of linear combinations of> order statistics to estimate the parameters of distributions. The slope of> the NPP line is simply one particular linear combination of order> statistics. There is some theory to specify "optimal" estimates. As above,> the theory for this would not be easily transferred for practical use in the> more complicated regression-residual case

Ray, David,

Thank you both for your explanations. I do understand the statementthat the residuals of the 1st SLR are not IID even though the errordistributions are. To fully understand the underlying reasons, Ithink I need to delve much more into the theory. For the time being,I take this as meaning that using SLR on the NPP is quite theapproximation. Can I also take it as meaning that the answer tochoosing df=N-1 or df=N-2 is not a simple answer (and may not evenmake sense), or is there actually a rationale for choosing the lesserof the 2 evils?