But I want to pass a WIDTH*HEIGHT array to an existing C library that expects a single plane of this data. That would be a sequence of just the R values (or just the G, or just the B).

It's easy enough to allocate a new array and copy the data (duh). But the images are very large. If it weren't a C library but took some kind of iteration interface to finesse the "slicing" traversal, that would be great. But I can't edit the code I'm calling...it wants a plain old pointer to a block of sequential memory.

HOWEVER I have write access to this array. It is viable to create a routine that would sort it into color planes. I'd also need a reverse transformation that would put it back, but by definition the same method that sorted it into planes could be applied to unsort it.

You are talking about transposing a 3xN matrix. The naive transpose operation is inefficient because it's full of cache misses. Google for "cache-efficient transpose".
–
Oliver CharlesworthDec 11 '11 at 17:36

@DeadMG C++. I actually think I understand the problem pretty well (including the trick that scrambling to get just one of the planes sorted with the only requirement to be reversibility can be a trivial operation...which is why I'm only discussing the harder problem). I'm just contemplating clever alternatives, such as those which would require (say) just WIDTH bytes or just HEIGHT bytes of overhead.
–
HostileForkDec 11 '11 at 19:53

Or use an algorithm which writes each value to its transposed place, then does the same for the value from that place, and so on until cycle is connected. Flag processed values in a bit vector. And continue until this vector is all 1s.

Both algorithms are not cache-friendly. Probably some clever use of PREFETCH instruction can improve this.

My original intent is to use a bit vector of size WIDTH*HEIGHT, which gives overhead of WIDTH*HEIGHT/8. But it is always possible to sacrifice speed for space. The bit vector may be of size WIDTH or HEIGHT or any desirable value, even 0. The trick is to maintain a pointer to the cell, before which all values are transposed. The bit vector is for cells, starting from this pointer. After it is all 1s, It is moved to next position, then all the algorithm steps are performed except actual data movement. And the bit vector is ready to continue transposing. This variant is O(N^2) instead of O(N).

Edit2:

PREFITCH optimization is not difficult to implement: just calculate indexes, invoke PREFETCH, and put indexes to a queue (ringbuffer), then get indexes from the queue and move data.

Edit3:

The idea of other algorithm, which is O(1) size, O(N*log(N)) time, is cache-friendly and may be faster than "cycle" algorithm (for image sizes < 1Gb):

Split N*3 matrix to several 3*3 matrices of char and transpose them

Split the result to 3*3 matrices of char[3] and transpose them

Continue while matrices size is less than the array size

Now we have up to 3*2*log3(N) ordered pieces. Join them.

First join pieces of equal size. Very simple "cycles" of length 4 may be used.

+1 thanks for the paper. Although "in-shuffle" seems like a strange thing to call it...I would generally say that "shuffling" is understood to be a randomizing operation. Terminologically speaking, I think you and @OliCharlesworth have nailed it better by categorizing it as a matrix transposition. I'd be interested to see a "cycle-processing" version with a bit vector that actually worked, everything I've considered in that area has been a dead end.
–
HostileForkDec 11 '11 at 19:57