On Mon, Nov 2, 2009 at 9:53 AM, Souheil Inati <souheil.inati@nyu.edu> wrote:
>> On Nov 2, 2009, at 4:37 AM, Sturla Molden wrote:
>>>josef.pktd@gmail.com skrev:
>>> Good, I didn't realize this when I worked on the eig and svd
>>> versions of
>>> the pca. In a similar way, I was initially puzzled that pinv can be
>>> used
>>> on the data matrix or on the covariance matrix (only the latter I
>>> have seen
>>> in books).
>>>>>>>>>> I'll try to explain... If you have a matrix C, you can factorize like
>> this, with Sigma being a diagonal matrix:
>>>> C = U * Sigma * V'
>>>>>>> u,s,vt = np.linalg.svd(c)
>>>> If C is square (rank n x n), we now have the inverse
>>>> C**-1 = V * [S**-1] * U'
>>>>>>> c_inv = np.mat(vt.T) * np.mat(np.eye(4)/s) * np.mat(u.T)
>>>> And here you have the pathology diagnosis:
>>>> A small value of s, will cause a huge value of 1/s. This is
>> "ill-conditioning" that e.g. happens with multicolinearity. You get a
>> small s, you divide by it, and rounding error skyrockets. We can
>> improve
>> the situation by editing the tiny values in Sigma to zero. That just
>> changes C by a tiny amount, but might have a dramatic stabilizing
>> effect
>> on C**-1. Now you can do your LU and not worry. It might not be clear
>> from statistics textbooks why multicolinearity is problem. But using
>> SVD, we see both the problem and the solution very clearly: A small
>> singular value might not contribute significantly to C, but could or
>> severly affect or even dominate in C**-1. We can thus get a biased but
>> numerically better approximation to C**-1 by deleting it from the
>> equation. So after editing s, we could e.g. do:
>>>>>>> c_fixed = np.mat(u) * np.mat(np.eye(4)*s) * np.mat(vt)
>>>> and continue with LU on c_fixed to get the quadratic form.
>>>> Also beware that you can solve
>>>> C * x = b
>>>> like this
>>>> x = (V * [S**-1]) * (U' * b)
>>>> But if we are to reapeat this for several values of b, it would make
>> more sence to reconstruct C and go for the LU. Soving with LU also
>> involves two matrix multiplications:
>>>> L * y = b
>> U * x = y
>>>> but the computational demand is reduced by the triangular structure
>> of L
>> and U.
>>>> Please don't say you'd rather preprocess data with a PCA. If C was a
>> covariance matrix, we just threw out the smallest principal components
>> out of the data. Deleting tiny singular values is in fact why PCA
>> helps!
>>>> Also beware that
>>>> pca = lambda x : np.linalg.svd(x-x.mean(axis=0), full_matrices=0)
>>>> So we can get PCA from SVD without even calculating the covariance.
>> Now
>> you have the standard deviations in Sigma, the principal components in
>> V, and the factor loadings in U. SVD is how PCA is usually computed.
>> It
>> is better than estimating Cov(X), and then apply Jacobi rotations to
>> get
>> the eigenvalues and eigenvectors of Cov(X). One reason is that Cov(X)
>> should be estimated using a "two-pass algorithm" to cancel
>> accumulating
>> rounding error (Am Stat, 37: p. 242-247). But that equation is not
>> shown in most statistics textbooks, so most practitioners tend to not
>> know of it .
>> We can solve the common least squares problem using an SVD:
>>>> b = argmin { || X * b - Y || ** 2 }
>>>> If we do an SVD of X, we can compute
>>>> b = sum( ((u[i,:] * Y )/s[i]) * vt[:,i].T )
>>>> Unlike the other methods of fitting least squares, this one cannot
>> fail.
>> And you also see clearly what a PCA will do:
>>>> Skip "(u[i,:] * Y )/s[i]" for too small values of s[i]
>>>> So you can preprocess with PCA anf fit LS in one shot.
>>>> Ridge-regression (Tychonov regularization) is another solution to the
>> multicollinearity problem:
>>>> (A'A + lambda*I)*x = A'b
>>>> But how would you choose the numerically optimal value of lambda? It
>> turns out to be a case of SVD as well. Goloub & van Loan has that on
>> page 583.
>>>> QR with column pivoting can be seen as a case of SVD. Many use this
>> for
>> least-squares, not even knowing it is SVD. So SVD is ubiquitous in
>> data
>> modelling, even if you don't know it. :-)
>>>> One more thing: The Cholesky factorization is always stabile, the LU
>> is
>> not. But don't be fooled: This only applies to the facotization
>> itself.
>> If you have multicolinearity, the problem is there even if you use
>> Cholesky. You get the "singular value disease" (astronomic rounding
>> error) when you solve the triangular system. A Cholesky can tell you
>> if
>> a covariance matrix is singular at your numerical precision. An SVD
>> can
>> tell you how close to singularity it is, and how to fix it. SVD
>> comes at
>> a cost, which is slower computation. But usually it is worth the
>> extra
>> investment in CPU cycles.
>>>> Sturla Molden
Thanks Sturla,
I'm going to slowly work my way through this. at least I'm able now to
calculate a inverse matrix squareroot, which is useful for the quadratic
form and we will switch away from some of the remaining uses of the
matrix inverse in statsmodels.
>>> I agree with Sturla's comment's above 100%. You should almost
> always use SVD to understand your linear system properties. For
> least squares fitting QR is the modern, stable algorithm of choise.
> (see for example the matlab \ operator). It's really a crime that we
> don't teach SVD and QR.
>> There are two sources of error: 1. noise in the measurement and 2.
> noise in the numerics (rounding, division, etc.). A properly
> constructed linear system solver will take care of the second type of
> error (rounding, etc.). If your system is ill-conditioned, then you
> need to control the inversion so that the signal is maintained and the
> noise is not amplified too much. In the overwhelming majority of
> applications, the SNR isn't better than 1000:1. If you know your the
> relative size of your noise and signal, then you can control the SNR
> in your parameter estimates by choosing the svd truncation (noise
> amplification factor).
>> For those of you that want an accessible reference for numerical
> stability in linear algebra, this book is a must read:
> Numerical Linear Algebra, Lloyd Trefethen
>http://www.amazon.com/Numerical-Linear-Algebra-Lloyd-Trefethen/dp/0898713617
It really depends on the application. From the applications I know,
pca is used for dimension reduction, when there are way too many
regressors to avoid overfitting. The most popular in econometrics
might be
Forecasting Using Principal Components from a Large Number of Predictors
# James H. Stock and Mark W. Watson
# Journal of the American Statistical Association, Vol. 97, No. 460
(Dec., 2002), pp. 1167-1179
A similar problem exists in chemometrics with more regressors than
observations (at least from the descriptions I read when reading about
NIPALS).
I don't think that compared to the big stochastic errors, numerical
precision plays much of a role. When we have large estimation errors
in small samples in statistics, we don't have to worry, for example,
about 10e-15 precision when our sampling errors are 10e-1.
Of course, there are other applications, and I'm working my way slowly
through the numerical issues.
Josef
>> Cheers,
> Souheil
>> ---------------------------------
>> Souheil Inati, PhD
> Research Associate Professor
> Center for Neural Science and Department of Psychology
> Chief Physicist, NYU Center for Brain Imaging
> New York University
> 4 Washington Place, Room 809
> New York, N.Y., 10003-6621
> Office: (212) 998-3741
> Fax: (212) 995-4011
> Email: souheil.inati@nyu.edu>> _______________________________________________
> SciPy-User mailing list
>SciPy-User@scipy.org>http://mail.scipy.org/mailman/listinfo/scipy-user>