Longevity of xor eax,eax on pII

or someother way of zeroing a 32bit register to avoid partial register stalls on the PII.

At what point does the PII suddenly decide that the reg in question is now (not) being accessed in a partial manner?

If I access it only be byte parts, ie. only ah,al, is that ok, or... what? When do I start incurring penalties again?

Jim

Tue, 10 Jul 2001 03:00:00 GMT

Terje Mathise#2 / 12

Longevity of xor eax,eax on pII

Quote:

> > Hi,

> > Lots has been said about the benefits of using

> > xor eax,eax

> > or someother way of zeroing a 32bit register to avoid partial register > > stalls on the PII.

> > At what point does the PII suddenly decide that the reg in question is > > now (not) being accessed in a partial manner?

> > If I access it only be byte parts, ie. only ah,al, is that ok, or... > > what? When do I start incurring penalties again?

> As far as I understand, it works like this. All registers have a "dirty" > flag associated with them. If you perform a XOR reg32,reg32 (and possibly > a SUB reg32,reg32) this internal marker get's cleared. If you write to a > partial register (word or byte access) and the flag is clear, the processor > "knows" that it can just zero-extend the data and write the whole 32 bits > of the register. If the flag is set on the other hand, the processor "knows" > it needs to merge your new data with old data from the remaining parts of > the register, and a stall occurs while the merge is in progress Any > instruction writing a register or parts of it (except of course XOR > itself) causes the register to be marked as dirty.

> I am sure Agner Fog (or Vesa, or Terje) will set me straight if I explained > this incorrectly.

No, we won't because your explanation is exactly right.

The only key issue you left out is the fact that these 'dirty' bits are not a part of the visible processor state, i.e. there is no way to save/restore them across an interrupt.

This means that the PRS-avoiding hack only works when the distance (in time) between the XOR EAX,EAX and the MOV AL,data/use of EAX is so short that there hasn't been any kind of interrupt in the meantime.

Andy Glew once told me that this very gotcha had caught some Intel:

The key register(s) were XOR'ed outside a loop with a very high iteration count, so during execution a majority of the time spent in the loop would be after the first external interrupt. This made the loop run _much_ slower than it should have since every partial acces/full-width use combination generated another PRS.

I have actually seen behaviour like this for code which really shouldn't have been able to run long enough to give a noticable chance of an external interrupt: This code ran exactly twice as fast after I made two separate versions of it, using the original for Pentium and lower, and the new for PPro+.

The PPro version used MOVZX exclusively, and it ran very close to the theoretical speed.

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"

Fri, 13 Jul 2001 03:00:00 GMT

James Sh#3 / 12

Longevity of xor eax,eax on pII

On 25 Jan 1999 15:43:25 GMT, Terje Mathisen

Quote:

>> > Hi,

>> > Lots has been said about the benefits of using

>> > xor eax,eax

Thanks Terje et al.

Basically, the registers stay clean for as little time as possible. Even an interrupt can affect the current status of the registers.

If I use al, and then access ah, I still suffer a stall, because eax now sees that it has to combine the bits together.

Arse. I'll be re-coding my clip flags routine again tomorrow.

Scenario: I have a routine which clips homogneous coordinates. ie. it has to do

x > z x < -z y > z y < -z

I do this ATM with xor eax,eax

cmp foo,bar sets ah or al,ah add al,al ..repeat for each clip plane.

This has the (massive) advantage of having no jumps which kill the PII worse than PRS - mainly because there is no way of branch predicting this sort of code.

It would probably be better to move the flags into a new register, but at a guess from what you say, the 'dirty' flag will be set if I adc, shl etc. on any register.

Jim

Sun, 15 Jul 2001 03:00:00 GMT

Paul Hsi#4 / 12

Longevity of xor eax,eax on pII

Quote:

> > As far as I understand, it works like this. All registers have a "dirty" > > flag associated with them. If you perform a XOR reg32,reg32 (and possibly > > a SUB reg32,reg32) this internal marker get's cleared. If you write to a > > partial register (word or byte access) and the flag is clear, the processor > > "knows" that it can just zero-extend the data and write the whole 32 bits > > of the register. If the flag is set on the other hand, the processor "knows" > > it needs to merge your new data with old data from the remaining parts of > > the register, and a stall occurs while the merge is in progress Any > > instruction writing a register or parts of it (except of course XOR > > itself) causes the register to be marked as dirty.

> > I am sure Agner Fog (or Vesa, or Terje) will set me straight if I explained > > this incorrectly.

> No, we won't because your explanation is exactly right.

> The only key issue you left out is the fact that these 'dirty' bits are > not a part of the visible processor state, i.e. there is no way to > save/restore them across an interrupt.

> This means that the PRS-avoiding hack only works when the distance (in > time) between the XOR EAX,EAX and the MOV AL,data/use of EAX is so short > that there hasn't been any kind of interrupt in the meantime.

> Andy Glew once told me that this very gotcha had caught some Intel:

> The key register(s) were XOR'ed outside a loop with a very high > iteration count, so during execution a majority of the time spent in the > loop would be after the first external interrupt. This made the loop run > _much_ slower than it should have since every partial acces/full-width > use combination generated another PRS.

Ahahahahahaha!!! I'm sorry but that is just too funny. I can just see Intel trying to explain this in their documentation. Its an interesting anomoly that perhaps Anger should add to his infamous guide.

Anyhow, as Agner indicates, however, if the register retires and leaves the "forwarding space" I think that the partial registers are collected, and I would assume that this flag is also turned off. What this suggests to me is that you can simply wait long enough before using the register again, and it will automatically be recollected back into a uniform register. I have not confirmed this myself, however.

-- Paul Hsieh

Sun, 22 Jul 2001 03:00:00 GMT

Terje Mathise#5 / 12

Longevity of xor eax,eax on pII

Quote:

> > The key register(s) were XOR'ed outside a loop with a very high > > iteration count, so during execution a majority of the time spent in the > > loop would be after the first external interrupt. This made the loop run > > _much_ slower than it should have since every partial acces/full-width > > use combination generated another PRS.

> Ahahahahahaha!!! I'm sorry but that is just too funny. I can just see > Intel trying to explain this in their documentation. Its an interesting > anomoly that perhaps Anger should add to his infamous guide.

> Anyhow, as Agner indicates, however, if the register retires and leaves > the "forwarding space" I think that the partial registers are collected,

Yes, that's what retirement means: The renamed registers are written back to the 'true' architectural registers.

Quote:

> and I would assume that this flag is also turned off. What this suggests > to me is that you can simply wait long enough before using the register > again, and it will automatically be recollected back into a uniform > register. I have not confirmed this myself, however.

That is correct, but not very useful for fast code:

With 2-3 microops/cycle and 10-20 cycles before retirement, you'd have to wait 20-60 instructions between writing the partial register and using the full reg.

As I said, not very useful for fast code. :-(

Terje

--

Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"

Sun, 22 Jul 2001 03:00:00 GMT

Jerry Coff#6 / 12

Longevity of xor eax,eax on pII

[ ... ]

Quote:

> Yes, that's what retirement means: The renamed registers are written > back to the 'true' architectural registers.

except for exceptions -- on a K6 (or close relative thereof) retirement is separate. Retirement only happens when all four micro- ops (oops, I mean RISC86 instructions) on an op-quad have had their results written back.

Quote:

> > and I would assume that this flag is also turned off. What this suggests > > to me is that you can simply wait long enough before using the register > > again, and it will automatically be recollected back into a uniform > > register. I have not confirmed this myself, however.

> That is correct, but not very useful for fast code:

> With 2-3 microops/cycle and 10-20 cycles before retirement, you'd have > to wait 20-60 instructions between writing the partial register and > using the full reg.

> As I said, not very useful for fast code. :-(

Depending -- in some cases, having 30 or 40 other instructions first is perfect reasonable, especially if you can put a loop in-between. AAMOF, it seems some basic ideas just keep coming back around -- this reminds me a lot of maximizing throughput with older floating point units by putting LOTS of instructions between starting the FPU doing something, and putting it to use. As I recall, on a 387, fsin (for example) could take over 500 cycles, which was typically around 100 to 150 integer instructions.

Of course, it IS really a pain to deal with this -- in general, the more other "stuff" you do between an instruction and using the result, the more difficult it makes maintenance later. Every instruction executed in the interim is an opportunity to mess things up...

Mon, 23 Jul 2001 03:00:00 GMT

James Sh#7 / 12

Longevity of xor eax,eax on pII

Quote:

>Ahahahahahaha!!! I'm sorry but that is just too funny. I can just see >Intel trying to explain this in their documentation. Its an interesting >anomoly that perhaps Anger should add to his infamous guide.

>Anyhow, as Agner indicates, however, if the register retires and leaves >the "forwarding space" I think that the partial registers are collected, >and I would assume that this flag is also turned off. What this suggests >to me is that you can simply wait long enough before using the register >again, and it will automatically be recollected back into a uniform >register. I have not confirmed this myself, however.

Thanks guys or your help. It seems that there's little hope for *real* code to avoid PRS, especially under OS controlled conditions. Haven't rethought my code yet though! It seems that it's not worth relying on the fact that _your_ code knows what PRS is, because something else in the OS has a good chance of screwing your expectations. Still, I still save because of lack of branches. It must be horrible for Intel to realise they screwed up so badly - PRS is deadly, but there seems to be no sensible ay to avoid it over lengthy code sequences running on multitasking os'es.

Now, anyone got any answers for debugging FPU code under PII.

It's a nightmare. The CPU pipeline is so deep that fpu exceptions occur many cycles past when they would have on a 486. Usually the symptom is a crash on a docile instruction like float a = b * c, and I can print b and c to the terminal and they are ok. It's just that something a few dozen cycles ago screwed it up.

Thanks for any further help.

Jim PS. Does anyone know where agner fog's page went? It used to be at www.announce.com/~agner but it has died.

Tue, 24 Jul 2001 03:00:00 GMT

Terje Mathise#8 / 12

Longevity of xor eax,eax on pII

Quote:

> PS. Does anyone know where agner fog's page went? > It used to be at > www.announce.com/~agner > but it has died.

www.agner.org

Terje --

Using self-discipline, see http://www.eiffel.com/discipline "almost all programming can be viewed as an exercise in caching"