I checked this by calculating norm(my_sum_for, my_sum) = 1.7861e-10. It is not new for me that vectorized and looped versions of the same code produces slightly different results.
–
N0rbertSep 6 '12 at 11:36

Is the actual contents of C really just ones? And there are some repeated elements in the matrix k*z that you might be able to skip computing the tan of, but given the values of K and n_Z there's not much to save there.
–
Rody OldenhuisSep 6 '12 at 14:33

It is an example, C is array of some coefficients. Memory consumption of the latest version is the smallest. May be there is a faster solution with more memory consumption? My program calls this code 2500 times, so it is 250 s. I hope that we can make code faster.
–
N0rbertSep 6 '12 at 15:18

You'll see that all of the pain comes from the tangent function, so you had best focus on this. As you can read here, speeding trig functions up is basically only possible if you sacrifice some accuracy.

In all you should ask yourself these questions:

Are you willing to give up full double precision, or are, say, 6 digits "close enough"?

Can't you re-formulate the problem so that the tangent is computed afterwards? or before? Or anyway on a significantly smaller amount of elements? In the problem as stated this is obviously not possible, but I don't know the full code -- there might be some nice trig-identities that could apply to your problem.

Given all of the above, does the amount of effort needed to optimize this further really outweigh the longer runtime? 250 seconds does not sound too bad compared to writing custom, hardly portable MEX functions that implement crippled-but-fast trig functions.

Good way to show the culprit here (tan)! You get one up vote from me :) Are you positive MATLAB will automatically multithread trig operations? Even without a parallel toolkit installed?
–
zeFrenchySep 6 '12 at 21:16

@DominiqueJacquel It was a new feature in R2007a (see here for example). You can see this in task manager/top if you do the calculation in an infinite while loop; you'll see that MATLAB is using most/all cores.
–
Rody OldenhuisSep 7 '12 at 2:52

@Rody Oldenhuis Thank you for your complete answer. Sometimes I think about calculating tasks in pure Fortran/C or using GotoBLAS/OpenBLAS, they are faster (in execution time) but development of such code is slower.
–
N0rbertSep 7 '12 at 9:32

Essentially, instead of an outer product of k and z I operate directly on matrices.

First version

Elapsed time is 0.652923 seconds.
Elapsed time is 0.240300 seconds.

After Dominique Jacquel's answer

Elapsed time is 0.376218 seconds.
Elapsed time is 0.214047 seconds.

My version

Elapsed time is 0.168535 seconds.

You may have to add the cost of repmats, but maybe you can do that once only, I don't know the rest of the code.

I fully agree with Rody Oldenhuis. The majority of the work lies in the tangent function. I can tell you more. k.*z computation is very efficient and can not be improved much. If you calculate the memory bandwidth, it gets around 10GB/s on my computer. The peak I can get is around 16GB/s, so its close. Not many possibilities there. Same with C*T. That is a simple BLAS2 matrix-vector multiplication, which is memory bounded. For the system sizes you are showing the MATLAB overhead is not too big.

Edit: as Rody mentioned, new versions of MATLAB already do parallelize tan(). So not much here either.

You can only hope to improve the tan() - possibly by running it in parallel. After all, it is a trivially parallelizable task... Consider exporting just this to a MEX file, which will use OpenMP. Very simple work, lots of speedup if you have a few spare cores.

Thank you. What is interesting new versions of MATLAB (tested on 2008b) does not support mcc -x flag: Error: -x is no longer supported. The MATLAB Compiler no longer generates MEX files because there is no longer any performance advantage to doing so: the MATLAB JIT accelerates M-files by default. To hide proprietary algorithms, use the PCODE function. May be I'll try to call external self-made tan-function.
–
N0rbertSep 7 '12 at 9:36

1

I meant to write your own MEX function and compile it using mex. This way you could skip the entire repmat / k * z business. This only eats up the memory bandwidth and affects execution time. k*z is a BLAS2 operation. You do not need to explicitly create the matrix. You can compute its entries on the fly..
–
angainorSep 7 '12 at 9:45