>mason+@transarc.com (Tony Mason) writes:>>[quoting from Dr. Dobbs]>>[Old optimization rules fail on the 486, where, a move to or from memory>> takes one cycle, but exchanging two registers takes three cycles.]

John Levine (compilers-request) writes:>[Most uses I have seen are for atomic operations.]

It is certainly the case that exchange is used for atomic operations.
But that is because there are no `ordinary' instructions that will do
the trick. The other half of the question is ``what non-atomic
operations can make use of the exchange instruction?'' I have an
answer in two parts: First, example code that uses an abstract
exchange operation. Second, details of the the iX86 `xchg' operation.

* Example Code

Here is code that the compiler could recognize as using exchange. I
saw this code yesterday in a data compression algorithm:

t := prev_val;
prev_val := a[i];
a[i] := t;

A reasonable implementation could make use of exchange, and a simple
peephole optimizer could replace instances of

mov rx, ry
ld rx, addr
st ry addr

with

xchg rx, addr

(as long as `ry' is not used subsequently). I can imagine a register
allocator that instead of doing spills and restores with

mov r0, -46[fp]
mov -50[fp], r0

figures out the lifetimes of the variables, figures out that they're
disjoint (after all, the stack slots got generated in the first place
becase there weren't enough empty registers...), creates one merged
stack slot, and replaces the above with

xchg r0, -44[fp]

This has advantages of (a) smaller code size and (b) better stack
locality (decreasing the stack size and increasing the utilization of
particular stack slots). Note that no analysis of user code is needed
for this latter operation, the spills and restores are
compiler-generated.

* iX86 `XCHG'

The iX86 has a `lock' prefix that can be applied to several
instructions. When those instructions are `lock'-prefixed, they are
guaranteed to behave atomically wrt memory. On a multiprocessor, if
two processors simultaneously execute

lock incl 4[r0]

then the value at 4[r0] is guaranteed to be incremented by 2 (one for
each processor). The `xchg' instruction can have the `lock' prefix
applied to it but -- for reasons I know not what -- omitting the
`lock' prefix has exactly the same effect, namely the exchange
operation is atomic whether or not you use the `lock' prefix.

Thus, on the i486 reads and writes that hit in the cache never go
off-chip. Except that all `xchg' instructions must assert a line off
chip saying ``don't touch (this) memory right now''. The 3-cycle
estimate is optmistic -- on a multiprocessor with a heavily-loaded
bus, atomic bus operations take substantially longer.

To summarize: it isn't hard to think of cases when exchange could be
useful, but you don't want to do it on the iX86 family because the
exchange instruction is defined as having atomic semantics that must
be enforced at any cost.