I think after looking at the data so far, we can conclude the following: Cortex A9 implementations such as Exynos 4 and OMAP4 are achieving about 0.4 flops/cycle on the multithreaded mode. Snapdragon S3 is also doing 0.4 flops/cycle.

However, Tegra 3 to be the exception to rule and is only achieving about 0.32 flops/cycle average. Tegra 2 is also stuck at about 0.36 flops/cycle. Wonder if it has something to do with Nvidia's memory controller.

About the single threaded results, actually I had forgotten that the single-threaded and multi-threaded versions are working on different problem sizes. The single threaded one is working on smaller matrices. This was to ensure that the single thread case does not take very long to run, but now starting to think that was a poor design decision on my part. So results from single threaded and multithreaded are not directly comparable. I think I will provide settings to choose the matrix sizes yourself sometime.

Actually I think this might have something to do with my under volted KernAl will flash jelly bean tomorrow and rerun thisI just re ran it on ova with UV 1thread 4584 threads 1961,1774 mflopsKinda weird

Thanks. Different firmware versions can indeed have various effects on thread scheduling and frequency scaling, so might be the issue. Here is my blogpost with prelim analysis of data, including assembler generated by GCC for the innermost loop. More technical readers will be interested: http://codedivine.org/2012/09/25/prelim ... rgbenchmm/

Interesting. Was not aware of that test. The MFlop numbers are still not reflecting what should be possible with the processors. LINPACK is a somewhat misleading benchmark name. Linpack tests are actually not a single benchmark. Rather, it is more of a "Calculate this using whatever algorithm you feel appropriate, as long as results are accurate and stable". Most Linpacks on servers are rewritten to heavily use the BLAS, and heavily depend on good matrix multiplication performance to the point where Linpack performance is close to the flops in a matrix multiplication benchmark (such as mine).

However, reference Linpack implementations in C/C++ also exist. These are more for reference/learning and do not use the BLAS and not used to test performance on servers. Given the relatively low FLOP rating you got, I think perhaps what they have compiled is this reference version.

edit: However, it should still be a good benchmark and hopefully not as terrible as the Java version.

killadark wrote:Actually I think this might have something to do with my under volted KernAl will flash jelly bean tomorrow and rerun thisI just re ran it on ova with UV 1thread 4584 threads 1961,1774 mflopsKinda weird

I'm getting the same result after flashing jelly bean so I think it's something to do with the benchmark

killadark wrote:Actually I think this might have something to do with my under volted KernAl will flash jelly bean tomorrow and rerun thisI just re ran it on ova with UV 1thread 4584 threads 1961,1774 mflopsKinda weird

I'm getting the same result after flashing jelly bean so I think it's something to do with the benchmark