Looking at the log the interface dumps shows that that matlab svd command is now calling LAPACK sgesdd, instead of sgesvd. Looking at the supported LAPACK svd functions in CULA, it seems that sgesdd is not one of the supported/implemented routines, so it makese sense that the link interface just falls back to the Intel MLK BLAS/LAPACK for sgesdd (on the CPU only -- no speedup using a GPU for MATLAB svd, for now).

It seems that sgesdd is not even supported in CULA premium, which is especially disappointing.

Has anyone else noticed this problem? Is anyone using CULA link with MATLAB?

Given that gsesvd is slower than sgedd, I'm also noticing that the user defined MEX routines (also mentioned on the above interfaces blog page) are not faster than the MATLAB svd command. That is, culaSvd(A) mex/cula routine is not faster than the default MATLAB svd(A) [with link interface not used).

So, there is no version of CULA (link iterface or user mex call to CULA routines) that I know of that is faster than Matlab's (version 2011a) own built in svd (running on 8 cores on dual xeon). Even with high end GPU cards (tried both TELSA 2050 and GTX 580). Is this really true, or am I doing something wrong?

Thanks for the quick feedback -- but I do think sgesdd is being called in 2011a:

I don't have access to my Windows7 x64 machine running Matlab 2011a, but I just checked and I get the same result with my Mac OSX (Snow Leopard) machine at home, also running Matlab 2011a. So here are some details (which look identical to what I get on Windows):

Using the cula link interface (using latest CULA R12 Free, for now) when start up matlab:

The log dumped by CULA link interface indicates that svd(A) with no nargouts does call sgesvd, but that [u,s,v]=svd(A) calls sgesdd instead (this might be something Mathworks very recently changed -- I think sgesdd is suppose to be faster than sgesvd in many cases, so they probably got wise -- and so CULA needs to provide a sgesvdd to keep up?:

>> randn('seed',1); A=randn(2000,2000,'single');>> tic;svd(A);tocElapsed time is 3.238710 seconds.>> tic;svd(A);tocElapsed time is 3.228221 seconds.>> tic;[u,s,v]=svd(A);tocElapsed time is 5.330816 seconds.>> tic;[u,s,v]=svd(A);tocElapsed time is 5.363748 seconds.

So, the GPU (330M, for MBP 2010) [for nargouts=0 call to svd, the only case which actually runs CULA GPU code] gets 2.98 secs, whereas CPU (duo core -- both go to 100%) gets 3.2-- they are about the same speed, which I think is about right, given the peak GFLOPS for the 330M. [On my Windows 8-core dual Xeon machine, the CULA svd (with nargouts=0) running on GPU (most recently, I was testing a GTX 560ti right now -- for which CUBLAS SGEMM gets me about 500 GFlops) is about 3x faster than matlab built-in svd]

This same result occurs for other matrix sizes, on both Windows 7 x64 and OSX -- so I think it is the case that Matlab 2011a *does* now call sgesdd when outputs ARE requested.

Hopefully sgesdd is something that CULA can add in the next/soon release -- I would love a good excuse to buy the CULA Premium, and then for dgesdd as well ... [and then hopefully in the near future see CULA get some spMv routines as well ... ]

BTW, I also just now tried Kyle's [~,s,~]=svd(A) example (I never realized the convenient "~" output syntax existed in Matlab ...), but the CULA log indicates that in that case Matlab's svd also calls sgesdd().

In "SVDD" and the 1st and 2nd steps are identical to the "normal SVD". However, more parallelism can be extracted from the 3rd step as, depending on the data, the problem can be broken into independent sub-problems.

So, the point being is that we have the majority of the work done. We'll look into the work required for implementing "divide-and-conquer" portion of step 3.

Also, the scaling of SVD is fairly poor until larger (over 4k) sizes are reached. Below this, memory bound routines like matrix-vector products will dominate the entire runtime. Speed-ups at largest sizes are obtained because the compute bound routines like matrix-matrix products begin to dominate the total runtime.

It's an interesting design decision from Mathworks, since they have made a change which will result in the users observing different behavior (and possibly different quality of result) from version to version. I'd have preferred a flag or a different routine name, myself, but I guess the routine is called "SVD" not "SVD via GESVD." Interesting find, thanks for writing in.

Hi, Kyle, I am wondering if CULA team has any future plan to give SVDS function as Matlab, which can perform SVD but select number of singular values? So far, I do not find any, because it is really important for us to use this one instead of complete SVD.Thank you in advance.---JiFeng

To my knowledge, there is no LAPACK equivalent of this. My testing, at least in Matlab, is that it's faster to run the full SVD and to then cut down the U,S,V matrices to the number of values that you want than it is to run the SVDS command.

Hi, john, thank you for you concerning and quick reply. Probably, you are talking about SVD for singular value only or small matrix size (say, (100,100)), yes, in this case, full SVD is quicker than SVDS. But if you also need U, V and matrix size is a little large, than full SVD becomes much slower. That's also why we need SVDS indeed. What do you think?