All post-Pentium4 CPU (newer than Nov. 2000) support the SSE2 register model. Simply adding the SSE2 target option to the builds would require the machines to be made this century but would use the SSE registers. The 16 directly addressable registers would reduce register stores to the stack and code scheduling (less shuffling of data around and more computation).

A simple recompile should make a noticeable difference without any side effects. If you compile newer than SSE2 or GPUs, you have to start worrying about and managing the population of target machines you deliver workloads to.

I just updated the minirosetta_beta app. I did not include SSE linux builds since it will require more testing. I did turn on SSE for the windows build. The latest linux SSE3 test was causing a significant amount of failures.

We don't have much time/resources to test these optimizations and it would be great of any of you would like to volunteer to help. As stated before, we can provide the source, build instructions, and tests. If you are interested please contact me directly at dekim AT u.washington.edu

I just updated the minirosetta_beta app. I did not include SSE linux builds since it will require more testing. I did turn on SSE for the windows build. The latest linux SSE3 test was causing a significant amount of failures.

We don't have much time/resources to test these optimizations and it would be great of any of you would like to volunteer to help. As stated before, we can provide the source, build instructions, and tests. If you are interested please contact me directly at dekim AT u.washington.edu

rjs5 over at this thread in R@H seemed very willing to help, don't know if you guys talked about the source code, et cetera via inbox.

SSE was introduced in 1999 with the Pentium-3 CPU and SSE2 was introduced in 2001 with the Pentium-4 CPU and only extended SSE. If something works under SSE, then it will work under SSE2 UNLESS it is a Pentium-3-era CPU.

The project will get more work done by sacrificing the Pentium-3 cycles (making SSE2 the minimum) and optimizing for SSE2+.

Once you get to SSE2, you will only get minor improvements, probably just a couple %, by going to the trouble of pushing the SSE/AVX envelop.

Since R@H is compiling an running in SCALAR mode which crunches only 1 64-bit value in the 128-bit dual 64-bit XMM registers, there is much more to gain by closely examining the source code and understanding what is preventing the compilers from VECTORIZING the code. If you can use BOTH 64-bit fields in the XMM registers, you get 2x performance increase. You crunch two, 4, 8, ... floating point values in the same time as 1.

This is also the reason that there is no GPU version and can NEVER be a GPU version until this is fixed .... IF the source can be changed to vectorized.

Starting from a generic, crappy 32-bit i386 version, .....
you get 80% of the scalar performance by just generating a 64-bit version.
you get the other 20% of scalar performance by messing with compiler options .... but at a high portability cost.

The next barrier after a 64-bit version should be SSE2.
The next barrier after 64-bit, SSE2 is VECTOR .... NOT .... SSE3, SSE4, AVX, ...

SSE was introduced in 1999 with the Pentium-3 CPU and SSE2 was introduced in 2001 with the Pentium-4 CPU and only extended SSE. If something works under SSE, then it will work under SSE2 UNLESS it is a Pentium-3-era CPU.

The project will get more work done by sacrificing the Pentium-3 cycles (making SSE2 the minimum) and optimizing for SSE2+.

Once you get to SSE2, you will only get minor improvements, probably just a couple %, by going to the trouble of pushing the SSE/AVX envelop.

Since R@H is compiling an running in SCALAR mode which crunches only 1 64-bit value in the 128-bit dual 64-bit XMM registers, there is much more to gain by closely examining the source code and understanding what is preventing the compilers from VECTORIZING the code. If you can use BOTH 64-bit fields in the XMM registers, you get 2x performance increase. You crunch two, 4, 8, ... floating point values in the same time as 1.

This is also the reason that there is no GPU version and can NEVER be a GPU version until this is fixed .... IF the source can be changed to vectorized.

Starting from a generic, crappy 32-bit i386 version, .....
you get 80% of the scalar performance by just generating a 64-bit version.
you get the other 20% of scalar performance by messing with compiler options .... but at a high portability cost.

The next barrier after a 64-bit version should be SSE2.
The next barrier after 64-bit, SSE2 is VECTOR .... NOT .... SSE3, SSE4, AVX, ...

What is the gain in going native 64-bit? I would've thought that going SSE2 would bring a higher gain than 64-bit (I've always associated the 64-bit to better memory addressing, rather than increased computation speed).

SSE was introduced in 1999 with the Pentium-3 CPU and SSE2 was introduced in 2001 with the Pentium-4 CPU and only extended SSE. If something works under SSE, then it will work under SSE2 UNLESS it is a Pentium-3-era CPU.

The project will get more work done by sacrificing the Pentium-3 cycles (making SSE2 the minimum) and optimizing for SSE2+.

Once you get to SSE2, you will only get minor improvements, probably just a couple %, by going to the trouble of pushing the SSE/AVX envelop.

Since R@H is compiling an running in SCALAR mode which crunches only 1 64-bit value in the 128-bit dual 64-bit XMM registers, there is much more to gain by closely examining the source code and understanding what is preventing the compilers from VECTORIZING the code. If you can use BOTH 64-bit fields in the XMM registers, you get 2x performance increase. You crunch two, 4, 8, ... floating point values in the same time as 1.

This is also the reason that there is no GPU version and can NEVER be a GPU version until this is fixed .... IF the source can be changed to vectorized.

Starting from a generic, crappy 32-bit i386 version, .....
you get 80% of the scalar performance by just generating a 64-bit version.
you get the other 20% of scalar performance by messing with compiler options .... but at a high portability cost.

The next barrier after a 64-bit version should be SSE2.
The next barrier after 64-bit, SSE2 is VECTOR .... NOT .... SSE3, SSE4, AVX, ...

What is the gain in going native 64-bit? I would've thought that going SSE2 would bring a higher gain than 64-bit (I've always associated the 64-bit to better memory addressing, rather than increased computation speed).

All x86_64 have at least SSE2. My first sentence above does not make much sense since 64-bit have SSE2 registers.

64-bit has 16 registers rather than 8 registers of the 386. There is substantial reduction in temporary register spills and fills to/from the stack. When you eliminate the traffic to store/restore data to stack variables, you reduce cycles per instruction. Saving registers to a temporary stack variable requires the WRITE be pushed out to the L2 cache which is typically 5 to 10 cycles. The L1 caches are all write-through.

What is the gain in going native 64-bit? I would've thought that going SSE2 would bring a higher gain than 64-bit (I've always associated the 64-bit to better memory addressing, rather than increased computation speed).

I don't understand.
The ralph/rosetta 64 bit app we are actually using is not "native"?

What is the gain in going native 64-bit? I would've thought that going SSE2 would bring a higher gain than 64-bit (I've always associated the 64-bit to better memory addressing, rather than increased computation speed).

I don't understand.
The ralph/rosetta 64 bit app we are actually using is not "native"?

It seems only the Linux version is 64-bit. The Windows is still 32-bit running with a 64-bit wrapper.

Some news. I was granted a source license and I have started wading in. The documentation is dated and inaccurate (as always). I am looking to hook up with a developer to focus on the configuration they build for this project and feed back findings.

Some news. I was granted a source license and I have started wading in. The documentation is dated and inaccurate (as always). I am looking to hook up with a developer to focus on the configuration they build for this project and feed back findings.