A 1.07 Tbit/s 128

Single Instruction Multiple Data (SIMD) engines are becoming common in modern processors to handle computationally intensive applications like video/image processing. Such processors require swizzle networks to permute data between compute stages. Existing circuit topologies for such networks do not scale well due to significant area and energy overhead imposed by a rapidly growing number of control signals, limiting the number of processing units in SIMD engines. Worsening interconnect delays in scaled technologies aggravate the problem. To mitigate this the authors propose a new interconnect topology, called XRAM, that re-uses output buses for programming, and stores shuffle configurations at cross points in SRAM cells, significantly reducing routing congestion, lowering area/power, and improving performance.