This involves 4 global memory operations: 2 reads and 2 writes (and 2 register operations, I think). Can't the same be done in one "exchange" operation on the global memory?

The point is that read&write operations transfer the data between the memory and a processing unit (involving the cache and polluting it). However, in a swap operation the processing unit (CPU or GPU) doesn't need the data: swap operation could be performed internally in the memory chips.

Whatever the communication path could be between memory chips, it is shorter than communication with the processing unit. The processor (CPU or GPU) gets involved, because if those memory addresses are cached, it needs to swap their values in its cache. It is just that the processing unit doesn't need to read and write the data from&to the global memory: it just needs to send a command to the global memory.

Because current computer programs are mostly memory speed bound, and swap operations are not rare, such an addition to the instruction set architecture should improve performance a lot: the number of memory-CPU or memory-GPU transfers is reduced 4 times (2 reads and 2 writes vs. 1 command to exchange).

Ideally you have to optimize the program to be able to read/write global mem in large bursts and inbetween do whathewer in more 'agile' and faster but smaller memory (local mem, cache, registers). You can even swap data by 'renaming' registers and with GCN3 you can swap data with neighbouring threads too using 0 overhead (with the DPP extension).

In global mem you can read only larger aligned blocks. The blocksize on CPU is like 16byte, and on GPU it's like 256byte. These are the smallest data that the memory interface can work and the more precise work is only realized by later stages in the system as the write combining caches, the LDS crossbar and the execution units.

Also one more thing: The main memory on a GPU works in the background so for the particular swap example, it would be a lot more efficient if you collect 64 of these swap tasks, do the reads, and when you do the 64 writes you can initiate the next batch of 64 reads and so on. Sure your program will be more complicated and difficult to maintain but the thing with GPU-s isn't going in a direction to [make them easier to use] but to [make them faster when properly used].

Wouldn't then DMA feature look like "a waste of die space" following such logic? DMA controller allows memory operations without CPU and CPU Cache involved. Currently, AFAIK, DMA controller only allows copying from/to devices, including GPUs. Swap is a kind of copying, just that it can happen inside memory chips, or at least in the DMA controller, but my point is that there is no need to involve CPU and pollute CPU cache, because in a swap operation CPU doesn't really need to know what data is at those addresses - it just needs the data swapped between addresses.

(btw: the GDS swap instruction is pretty close to it: It works asynchronously and swaps memory contents with register contents. Usually if it involves registers, there are many swap related things: like the 16byte permute thing in SSE4, and the packer/unpacker instructions in SSE2. These are rearranging data really fast. If the data manipulation occurring in the slow ram, the the cpu will wait for the ram most of the time, don't need to make that easier with new instructions. Instead you can combine the manual swapping with some other tasks to use instruction level parallelism.)

You can find examples in LibSVM (search for "swap"): during SVM iterations it identifies array indices of the items it doesn't need in the near future, then to keep the active set of the array dense it swaps the items it doesn't need with the tail of the array, reducing the array size. So here is at least 1 out of each 2 items swapped is not needed in the near future (meaning no need for it in the CPU cache). Furthermore, LibSVM also swaps items in its software cache lines, and for those items even both swapped items may not be needed in the near future. I think (semi-)scientific algorithms can give many more examples, such as rebalancing a balanced tree (actively used in C++ STL set/map containers), hash-table algorithms, priority queue algorithms, etc.