Recommended Posts

I'm writing a simple software renderer and trying to use SSE to accelerate it.

How would one write 4 pixels into the colorbuffer without first reading the previous contents of the framebuffer?

Is there a magical instruction for writing words of an SSE register selectively, based on mask?

i know only of _mm_store_si128() instruction, but it writes the whole register into memory,
so i need to fetch old 4 colors, combine them with computed colors using a bit mask, and write them back.
i'd like to avoid reading the old pixels.

Share this post

Link to post

Share on other sites

In AVX, there's the _mm_maskstore_ps/vmaskmovps instruction. In SSE2, there's the _mm_masmoveu_si128/maskmovdqu instruction, but note that this instruction is in the class of byte-wide integer instructions, so it can generate few cycles of stall in the pipeline when used (profile?) if a transition from float mode to int mode occurs.

If you are doing manual load-blend-store, there's the _mm_blend_ps/blendps and _mm_blendv_ps/blendvps instructions in SSE4.1, which can aid the process, although that kind of load followed by a store can be a large performance impact. For earlier than SSE4.1, that kind of blend between registers can be achieved by a sequence of and+andnot+or instructions.

Share this post

Link to post

Share on other sites

You should not worry about the loading/storing and masking. In the end, the controller will load a whole cacheline into L2 and L1, from there, it doesn't really matter whether you load/store 1byte or 32bytes, modern CPUs (Sandy Bridge, Ivy Bridge) can load two 16byte words per cycle, most older still can load 16byte per cycle, and internally it's anyway impossible to address just one byte in memory, the load/store unit has to get it, modify those bytes store it the whole bunch of data you did not modify.

To get best performance, you shall focus on using as few instructions as possible if you have data dependancies like in your code. e.g.
const __m128i oldQuad = _mm_load_si128( dest );
__m128i result = _mm_or_si128( oldQuad, _mm_and_si128( _mm_set1_epi32(RGBA8_WHITE), qiCXmask ) ); // Stall
_mm_store_si128( dest, result ); //stall

you have two potential stalls here, if the OoO units cannot find other independent instructions, the OR will wait until the LOAD and AND are done. the Store will again wait for the OR. this might in the end cost a lot.

So, I would suggest, just load,blend,store might be the nicest work for your pipeline. if you use 64bit, then unroll your loop to process 4lines at the same time (just like you process now 4pixel in a line, so you'd work on 16pixel per loop, the compiler will do a good job to utilize all 16 SSE registers and you'll probably end up with less cycles/pixel.