If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

*_RING are macros, so they are already inlined. You also left in place the OUT_RING and ADVANCE_RING macros, so the ring may be written twice (with strange side effects since tail pointer may be changed twice).
Furthermore I fail to parse the

Code:

if (write < ...

statements; typo?

Comment

*_RING are macros, so they are already inlined. You also left in place the OUT_RING and ADVANCE_RING macros, so the ring may be written twice (with strange side effects since tail pointer may be changed twice).
Furthermore I fail to parse the

Code:

if (write < ...

statements; typo?

The macros implement a write, an increment of an index, and a mask operation. My patch does a check to see if the mask operation is needed and if not executes just writes. This leads to code that takes 1/4 the time required to execute because the cpu can execute both writes in parallel. This gives you two writes per cpu cycle.

With the macro the write and the index increment can be done in parallel. The mask operation cannot be done in parallel with the next write because of a data dependency. So you get one write every two cpu cycles.

Comment

Just to be clear, it isn't your patch that's causing this because I haven't tried it yet. Just by the name of it it looks like some kind of OpenGL benchmark utility that can be used to test your performance hacks (if it doesn't kill your system)

Comment

Ops, I totally missed the "return" inside the if block; ok, you don't touch the ring twice if the shortcut is taken.
Anywayr what I see here is that GCC schedules the write to the ring (using a temp register for the index) between the increment and the masking, trying to fill the pipeline.
One difference is that in your fast-path the index becomes an immediate; I tent to be wary of such optimizations (open-coding) though: it's far too easy to introduce bugs when the open-coded parts are not kept in sync.
One thing that could be tried is moving the test for the wrap in BEGIN_RING, and set the mask to ~0 if it's not needed; gcc seems to be smart enough to skip the and in this case.
Side effect is that if you pass the wrong number of words to BEGIN then you end up writing past the ring...

Comment

Well no matter how GCC tries to hide the mask op its still at least 4 times slower than my optimized code.

Yes its ugly and hard to maintain but it can't be done faster in C. I'm almost willing to bet beer that this will not make it into the kernel (I would bet beer but I know that they will patch it into the kernel to win the bet then patch it out) but that's why I call it a hack.

NOTE: I have discovered that the x11perf test is not a very good test case as it can fluctuate 5% . My patches do however give a noticeable improvement to 3d games that bog down a system.

Comment

Well no matter how GCC tries to hide the mask op its still at least 4 times slower than my optimized code.

I'm not conviced it's the and; ignore it for a moment: with consecutive OUT_RINGs the CPU still needs to compute the next index into the ring before actually writing into it, so it's possible that a mov into the ring is stalled by the inc of the index.
With open-coded offsets instead the index is known at compile time and the compiler emits the movs back to back.