Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
User's Guide
3
. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. functionality. and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability. These optimizations include SSE2.Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. or effectiveness of any optimization on microprocessors not manufactured by Intel. SSE3.

As a result.
6
Document Number: 325262-002US
.cpp file. The full sorting sequence consists of repetitive kernel calls performed in ExecuteSortKernel() function of BitonicSort. For general reference on bitonic sorting networks. the kernel forms bitonic sequences of size four using SIMD sorting network inside each item of the input array.cl file performs the specified stage of each pass. the number of passes is incremented by one and the sequence size is doubled by merging two neighboring items.Sample for Bitonic Sorting
Algorithm
For an array of length 2N*4.
OpenCL* Implementation
Code Highlights
Bitonic sort OpenCL* kernel of BitonicSort.
Limitations
For the sake of simplicity. see [2]. The first stage has one pass. see [1]. For reference on sorting networks using SIMD data types. this algorithm completes N stages of sorting. Every input array item or item pair (depending on the pass number) corresponds to a unique global ID that the kernel uses for their identification. the current version of the sample requires input array of size of 4*2^N 32-bit integer items.Intel® SDK for OpenCL* . where N is a positive integer. For each successive stage.

Work-Group Size Considerations
Valid work-group sizes on Intel platforms range from 1 to 1024 elements. such as int4 or float4. with OpenCL* initialization and processing functions • • BitonicSort. This removes unnecessary branches. thus decreasing execution overhead. You can use sorting network inside a single vector item during the last pass on every stage. these optimizations bring additional 25% speedup to the explicitly vectorized version.the host code. saves memory bandwidth. enables the following optimizations: • • You can work with quads instead of single integers. and optimizes CPU cache usage.
Reference (Native) Implementation
Reference implementation is done in ExecuteSortReference() routine of BitonicSort. but uses pure scalar C nested loop.cl – OpenCL* sorting kernel source code BitonicSort.
User's Guide 7
. To achieve peak performance. you get approximately 5x speedup in total. Beside the maximum possible 4x speedup brought by SIMD register usage. This is single-threaded code that performs exactly the same bitonic sort sequence as OpenCL* code.
Project Structure
This sample project has the following structure: • BitonicSort. Explicit usage of these types. use work-groups of 64-128 elements.Understanding OpenCL* Performance Characteristics
Benefits of Using Vector Data Types
This sample implements the bitonic sort algorithm using vector data types.vcxproj – Microsoft Visual Studio* 2010 software project file containing all the required dependencies. As a result. This permits merging two last passes together to save an extra kernel invocation per stage.cpp .vcproj – Microsoft Visual Studio* 2008 software project file containing all the required dependencies • BitonicSort.cpp file.