Hi all,
I have just added application for Linux ARM v7a vfpv4 (address is the same: https://bitbucket.org/sirzooro/pc-boinc/downloads). It is about 30% faster than original one. I tested it on my Odroid XU4 and runs fine. Unfortunately NEON instructions does not support double precision operations, so additional optimalization with vectorization is not possible for ARM. Maybe some future generations of ARM CPUs will allow this.
____________

Hello,
Sorry for the question ... but I see no performance difference between AVX and FMA on an i7 4770K HT OFF / W7 Ultimate. Do you have any recommendation or is this normal ?
(Primegrid LLR WU's are using FMA3 I think, and it makes the CPU running at his highest perf.)
Thank You.
Philippe

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Yes, i have 5 hosts with Windows 32bit version.
If you can create 32bit app, it will be very cool.
Thanks for your optimization work

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Yes, i have 5 hosts with Windows 32bit version.
If you can create 32bit app, it will be very cool.
Thanks for your optimization work

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Yes, i have 5 hosts with Windows 32bit version.
If you can create 32bit app, it will be very cool.
Thanks for your optimization work

I have added 32-bit apps for windows, in 4 versions: without SIMD instructions (x87 FPU version), SSE2, AVX and FMA. They passed my small test, so they should give correct results. Let me know if they work for you.
____________

Hello,
Sorry for the question ... but I see no performance difference between AVX and FMA on an i7 4770K HT OFF / W7 Ultimate. Do you have any recommendation or is this normal ?
(Primegrid LLR WU's are using FMA3 I think, and it makes the CPU running at his highest perf.)
Thank You.
Philippe

Hello,
Sorry for the question ... but I see no performance difference between AVX and FMA on an i7 4770K HT OFF / W7 Ultimate. Do you have any recommendation or is this normal ?
(Primegrid LLR WU's are using FMA3 I think, and it makes the CPU running at his highest perf.)
Thank You.
Philippe

No need for an answer. I saw your benchmark tests.

You should see some difference, as in benchmark results above. Did you try to run different app versions manually on test data, or you ran them on BOINC tasks? If the latter, please keep in mind that they wary in length, some of them complete faster some slower.

I do not have at this moment BOINC libs compiled for Linux 32 bit and for Windows. I will prepare Windows app later. Let me know if you need 32-bit app for Windows or Linux too, I wonder if someone would need it, let me know if you need one.

Yes, i have 5 hosts with Windows 32bit version.
If you can create 32bit app, it will be very cool.
Thanks for your optimization work

I have added 32-bit apps for windows, in 4 versions: without SIMD instructions (x87 FPU version), SSE2, AVX and FMA. They passed my small test, so they should give correct results. Let me know if they work for you.

I just installed the optimized version on my Windows 32bit hosts and everything seems to work perfectly.
Thanks again for your work ^_^

Hello,
Sorry for the question ... but I see no performance difference between AVX and FMA on an i7 4770K HT OFF / W7 Ultimate. Do you have any recommendation or is this normal ?
(Primegrid LLR WU's are using FMA3 I think, and it makes the CPU running at his highest perf.)
Thank You.
Philippe

No need for an answer. I saw your benchmark tests.

You should see some difference, as in benchmark results above. Did you try to run different app versions manually on test data, or you ran them on BOINC tasks? If the latter, please keep in mind that they wary in length, some of them complete faster some slower.

I found something what may be a problem. You use undirected graph, so I thought that I could reduce number of iterations of loop at pc.cpp:418 to test (i,j) pairs for j > i only. However after doing this output file size changed from 47.8K to 67.6K. Original code before my changes also generated bigger file after applying this change. I checked code briefly and do not see anything obvious what may cause this. Could you take a look on this?

it is actually not possible to half the number of iteration because the algorithm choose a pair of node i,j and test whether the arc linking the two nodes should be removed. When l increases, the test is conditioned to a set of neighbours of size l of the first node. If the edge is not removed, it could be the case that there exists a set of neighbours of j of a certain size l that allows the removal of the edge. So, it is important to test all the possible combination of i,j.

I found something what may be a problem. You use undirected graph, so I thought that I could reduce number of iterations of loop at pc.cpp:418 to test (i,j) pairs for j > i only. However after doing this output file size changed from 47.8K to 67.6K. Original code before my changes also generated bigger file after applying this change. I checked code briefly and do not see anything obvious what may cause this. Could you take a look on this?

it is actually not possible to half the number of iteration because the algorithm choose a pair of node i,j and test whether the arc linking the two nodes should be removed. When l increases, the test is conditioned to a set of neighbours of size l of the first node. If the edge is not removed, it could be the case that there exists a set of neighbours of j of a certain size l that allows the removal of the edge. So, it is important to test all the possible combination of i,j.

I'm just thinking about strategies for deploying the new versions of the application. Some thoughts:
- SSE2 should be the base version (I guess that there are no more around computers without SSE2)
- AVX is okay, I don't know what to do with the FMA version
- we will have versions for Win x32-x64, Linux x32-x64, we are still missing a version for Mac-OS x64
- ARM. I'd like to have it in a standard way, but I don't know which platform is the more suitable (see here: https://boinc.berkeley.edu/trac/wiki/BoincPlatforms) and if there is the need of an app plan (see https://boinc.berkeley.edu/trac/wiki/AppPlan)

I'm just thinking about strategies for deploying the new versions of the application. Some thoughts:
- SSE2 should be the base version (I guess that there are no more around computers without SSE2)
- AVX is okay, I don't know what to do with the FMA version
- we will have versions for Win x32-x64, Linux x32-x64, we are still missing a version for Mac-OS x64
- ARM. I'd like to have it in a standard way, but I don't know which platform is the more suitable (see here: https://boinc.berkeley.edu/trac/wiki/BoincPlatforms) and if there is the need of an app plan (see https://boinc.berkeley.edu/trac/wiki/AppPlan)

Good news :) Few comments for this:
- stats on downloads page shows that 32-bit windows non-SSE version of my app was downloaded 12 times, so there is some need for it. You can also decide to provide this version later if someone will ask for it;
- FMA should be OK too. It should be sent to hosts which supports FMA3 instruction set;
- I am not sure if there is come crosscompiler ready. If Mac header files are available somewhere, you can try to build crosscompiler (crosstool package will be your friend);
- you need at lest two, arm-unknown-linux-gnueabihf and aarch64-unknown-linux-gnu (for 32 and 64 bit ARMs). There are 3 versions of 32-bit ARM app, so plan classes also will be needed. Supported FPU instruction set should be sent to server in similar way as for x86 CPUs.
- some projects try to send few app versions to client to gather some benchmarks and choose the fastest one. This is probably standard BOINC server feature. This would be good to use here, to check if AVX app is faster than SSE2, people reported mixed results for these apps. FMA app was always faster than AVX, but it may be worthwhile to benchmark it against SSE.
____________

OK. I just added the new sse2 windows/linux x64 versions and normal+sse2 for win32. Let's see if it works correctly before adding the other ones.

[addendum] I found a comment on boinc_dev saying that that:

> Any processor with avx will also have pni, so you should expect both apps
> to go to machines with AVX until the server can figure out which one is
> faster on a given host (which is usually about 10 results if there's a
> significant speed difference). If there is no speed difference, then both
> with be sent for a long time.

So, with sse2, avx, fma, any modern computer will get the three applications and eventually decide which one is the best one...

So, with sse2, avx, fma, any modern computer will get the three applications and eventually decide which one is the best one...

Potential problem: some machines error the WUs with avx and fma while sse2 seems to work with everthing I've tried.
I've found that fma can be slightly faster than the sse2 version on some machines but the difference is small.

So, with sse2, avx, fma, any modern computer will get the three applications and eventually decide which one is the best one...

Potential problem: some machines error the WUs with avx and fma while sse2 seems to work with everthing I've tried.
I've found that fma can be slightly faster than the sse2 version on some machines but the difference is small.