The online https://software.intel.com/sites/landingpage/IntrinsicsGuide/ for VPMASKMOV says that "mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated." But the documentation in the Intel Instruction Set Reference Guide does not mention an alignment requirement, and seems to imply that it is not required: "Faults occur only due to mask-bit required memory accesses that caused the faults.".

It is very nice to have this forum. I'm a fresh on the ISA Extension and expect to have your insight:)

My code snippet, which conducts a convolution computing, is attached as a figure. and here is my confusing issue:

Time was consumed hugely when I tried to assign the computed result to image buffer. Computing time of extension sets(line 512~544) only takes about 7~8ms, but the assign work(line 548) takes about 25~26ms.

I'm running an experiment on a server machine with a quad-core Xeon X5355 processor running a linux system.

I try to control core voltage and frequency separately by writing to the msr IA32_PERF_CTL (0x199). I change the value of IA32_PERF_CTL using a "wtmsr" command and verify that its value has been changed using a "rdmsr" command. However, when I run "rdmsr 0x199" again a few seconds later, I find that the value of IA32_PERF_CTL is overwritten with its previous value. The value of IA32_PERF_STATUS does not represent my change either.

while I was trying to solve some particular multi-thread problem, it occurred to me that it could be solved more efficiently with special assistance from the CPU.

The situation is as follows: say one thread needs to block until the content of a particular 4-byte (or can be other size) location in the memory is changed. (It think the usefulness of this is very obvious and there is no need to give concrete examples to demonstrate it).

I see that the latency for a vpgatherdd is 18 clocks. I am thinking that other load type instructions can execute while the vpgatherdd is still working, since it only performs 8 loads while the processor can issue one load per clock. Is that correct?

When Im processing 256-bits of data, is better to use (in one core) for this one whole YMMx register or to split them for 2x128-bits and process them through 2 XMMx registers at different ports, hence on different SSE/AVX unit (in Sandy Bridge there are 3 ports per core for AVX)? Which option is faster?