Introduction

CUDA and OpenCL gives any application access to the graphics processing unit for non-graphical computing and enables massive general purpose computations using GPUs. Before the advent of CUDA and OpenCL, GPGPU is done using graphics API shaders (HLSL and GLSL) which are awkward and unnatural to the task at hand because programmer has to 'bend' the shaders to do general purpose programming. CUDA is available for use on NVidia graphics card and x86 CPU. OpenCL is available on NVidia and AMD graphics card, x86 CPU and ARM CPU. Whereas DirectCompute is only available on MS Vista and Windows 7 platforms and DirectX 10/11 level graphics card. To read this article, reader is expected to be fairly acquainted with CUDA and OpenCL workings, its memory model and technical terms like kernel, host and device.

Algorithm

The algorithm used in this article is basically the same as my Permutation in C++, Part 2 article. The algorithm make use of factorial decomposition to find lexicographical permutations and is zero based. As the reader can see from the rating of my previous article, I am not very good at explaining this algorithm and reader may want to read the explanation elsewhere. To explain this algorithm, we will use 2 sets of elements, the first set is known as the unpicked set which will initially store all the elements and the second one is called the picked set which is initially empty but will eventually contain the final permutation. Element move one by one from unpicked set to the picked set. Let us walk through an example. Say we have a unpicked set of {0,1,2}. By computing the factorial of 3, we know that 3 elements have a total of 6 different permutations. Note: factorial operation is usually notated by ! sign, for example factorial of 2 is simply written as 2!. Factorial of 1 is defined to be 1.

0,1,20,2,11,0,21,2,02,0,12,1,0

From the list of (zero-based) lexicographical permutations, we know that the first permutation(0th) is 0,1,2 and the last permutation(5th) is 2,1,0. Let us find out about 4th permutation (which we know from the above list, is 2,0,1) using factorial decomposition.

2! * 0 = 0
2! * 1 = 2
2! * 2 = 4

The algorithm requires us to find highest product which is greater or equal to our nth index which is 4 so we know the first element is 2 because 4 is equal to the 3rd product and 3rd element in the unpicked set of {0,1,2} is 2. Let us proceed to find the 2nd element. Before that, we must subtract the found product from our nth index which is 4 - 4 = 0. {2,?,?} is our current pciked set with {0,1} remaining in the unpicked set.

1! * 0 = 0
1! * 1 = 1

And again, we find the highest number which is greater or equal to our subtracted nth index which is 0 so we know the second element is 0 because 0 is equal to the 1st product and 1st element of the unpicked set {0,1} is 0. {2,0,?} is our current partially found permutation with {1} remaining in the unpicked set. We need not compute any further because only 1 is remaining in the unpicked set. So the 4th permutation is {2,0,1}. Let's try another example! This time, we will find the 3th permutation of 3 elements {0,1,2}.

2! * 0 = 0
2! * 1 = 2
2! * 2 = 4

As the reader should have know by now, we need to find highest product which is greater or equal to our nth index which is 3. We know the first element is 1 because 3 is less than the 3rd product but more than the 2nd product and 2nd element of the unpicked set {0,1,2} is 1. Let us proceed to find the 2nd element of the picked set. First, we must subtract the found product from our nth index which is 3 - 2 = 1. Note: {1,?,?} is our current picked set with {0,2} remaining in the unpicked set.

1! * 0 = 0
1! * 1 = 1

And again, we find the highest number which is greater or equal to our subtracted nth index which is 1 so we know the second element is 2 because 2 is 2nd element of the unpicked set {0,2}. {1,2,?} is our current partially found permutation with {0} remaining in the unpicked set. Because only 0 is remaining in the unpicked set, so the 3th permutation is {1,2,0}.

CUDA Source Code

I present the source code for the CUDA kernel. The kernel function is called Permute. The arrDest parameter is an uninitialized array which will be filled with the results after the CUDA operation is completed. offset is the nth permutation to be start computing and Max is the maximum permutation to compute. As a simple example to illustrate the use of offset and Max parameters, let assume you had 1000 permutations to find but your GPU only have enough memory to find 500 permutations at any one time, so you have to split the CUDA operation into 2. The first operation compute the 0th to 499th permuations (offset=0 and Max=500) while the second operation compute the 500th to 999th permutations (offset=500 and Max=500). The nth permutation used in the algorithm is zero-based. The constant array, arr[i][j] stores the precomputed value of factorial(i) * j where i and j are both zero-based and factorial(0) is assumed to be zero instead of the usual 1 in order to get the code working correctly. Constant memory access in CUDA is faster than global memory but they are usually limited to 64KB and cannot be written to. When we are finding kth element out of a set of n, we are using the factorial of (n-k). Each element of arr is a 64 bit integer which can store up to maximum factorial of 20. The next_permutation is copied and modified from Visual C++ 10 STL. NEXT_PERM_LOOP specifies the number of times you want to use next_permutation.

To control operation parameters, we need to set macros accordingly. As an example, the macros below find permutations for 5 elements from 0th to 29th permutations. If the kernel cannot be run, the reader might want to play around with the values of BLOCK_DIM in multiple of 256.

I have included a check function to check the generated permutations against next_permutation and a display function to display the the results. Reader might want to comment out display function for massive number of permutations.

OpenCL Source Code

The OpenCL kernel function is presented below. The OpenCL code is similar to the CUDA one except the host code and kernel code are written in 2 separate source files. This leads to the need to set the NUMBER_OF_ELEMENTS and NEXT_PERM_LOOP macro twice (once in the host file and once in kernel source file). arr is a single dimensional array as opposed to two dimensional array in CUDA because it is not possible to define 2 dimensional array in OpenCL.

Benchmark

I have benchmarked the CUDA against the CPU application. The CPU and GPU used in the benchmark, are Intel i7 870 (8 cores), 2.93Ghz and NVidia Geforce 460 respectively. The CPU application make full use of the 8 cores to find permutations. The CPU application is using factorial decomposition to split the nth permutations among different CPU cores and in each worker threads, STL next_permutation is used to find every consecutive permutation from the first nth permutation. The results of computing permutations of 11 elements is listed below. The total number of permutations of 11 elements found is 39,916,800. The size of the array need to store the results is 39,916,800 x 11 = 439,084,800. This is the maximum number of permutations which my 1GB memory GPU can store.

The benchmark of 8 core CPU and the GPU is about the same. Does that mean the GPU permutation is useless? GPU permutation by no means is useless because it enables permutation and your custom kernel to be done wholly on GPU, instead of the CPU/GPU hybrid approach. Factorial decomposition has complexity of O(n^2) while STL
next_permutation has complexity of O(n) but we cannot use purely next_permutation on GPU because next_permutation depends on the current permutation to generate the next one. However, factorial decomposition can be used in conjunction with next_permutation on GPU to improve performance (similar to what I have done with the CPU benchmark). Higher number of use of next_permutation seem to have diminishing returns on performance as the number of workitems have to be massive to keep the GPU work units busy; 300ms is the lowest for 2, 3 or 4
next_permutation after which the performance will worsen. Version 1 with pure factorial decomposition is simple to be modified to be called from your custom kernel. Imagine you are using version 3 which can gives you flexibility of different number of next_permutation in 1 kernel call, and if your custom kernel only needs 1 permutation input, what are the programmer going to do with the rest of permutations generated by next_permutation?

Massive Permutations Benchmark

I decide to modify the code to benchmark finding permutations of 12 elements. Total permutation is 479,001,600. Total bytes needed to store all permutation is 479,001,600 * 12 = 5,748,019,200 (around 6GB) which is way too much to store in my 1GB Graphics card. I break up the computation in batches of 12 to fit into my graphics card memory. The number you choose to break them up must be a factor of total permutations, meaning the total permutations could be cleanly divided by this number without remainder. The 1st batch is computed using factorial decomposition while the rest is using pure next_permutation. The results is below.

The 1st timing using factorial decomposition, is 649ms. The total timing is 905ms. The CPU version takes 6.5 seconds. The GPU performance is improved over CPU version by 7x! The reader can find this benchmark source code under PermCudaBenchmark folder. The readers may wish to comment out the checking function which takes a long time to check all the generated permutations.

Limitations

The presented algorithm can only find permutations for a maximum of 20 elements. This is because 64 bit integer can only store at most up to factorial(20). Note: A graphics card with 1 GB can store permutations of 11 elements (which is about 0.5 GB). Permuting 12 or more elements, the programmer have to break up the operation into batches (Preferably the 1st operation is using factorial decomposition while the rest is using next_permutation). As indicated in this wiki doc, oclHashcat-plus permutation limit is 11. For those who are not familiar, oclHashcat-plus is the advanced password recovery software running on GPU. Also this algorithm does not handle duplicate elements. A note of caution: the OpenCL code is only tested on NVidia graphics card, AMD users have to enable the check function (enabled by default) to ensure the permutations are correct. As implied, OpenCL code is not tested on any x86 OpenCL implementation.

Points of Interest

How did my affection for permutation came about? The history started in 1998, there was this prize game held by the local radio station, FM 93.3. The radio will give out 6 digits and the radio listeners had to arrange these 6 digits into 2 numbers with 3 digits, and these 2 numbers have to add up to 933. The cash prize was very attractive and it accumulated each time a listener fails to get the correct answer. I took the brute force approach to solve the problem. It took me a few days to get the permutations working in Visual Basic 6. The funny thing was I never managed to call through the radio station hotline to give my answer. I listened with helplessness, as the $1000+ cash prize flew away from me.

Conclusion

In this article, we have look at how to implement generation of lexicographical permutations on GPU, using NVidia's CUDA. The benchmark seem to indicate both CPU and GPU have similar performance. We have improved the performance by 40% using hybrid approach of factorial decomposition and next_permutation. GPU has the advantage of running your custom operations mainly using GPU, instead of hybrid CPU/GPU approach if you have computed permutation on the CPU beforehand. To build and run the CUDA source code, reader must download the latest CUDA SDK. The source code is hosted at Codeplex. Any comments on how to further optimize the CUDA and OpenCL code is most welcome.

OFFSET is a defined constant literal. When OFFSET is defined as 0, the if condition will become if(0!=0) which is always false and the compiler will optimize away that block of code. That is the main purpose of if condition.

Wong, I don't pretend to be an expert in C++ but I found the concept of the article more readable than the code itself!

The main purpose of my post is to give you the heads-up of an excellent piece of code written by James McCaffrey. He deals with permutations from a different angle. I thought I'd let you know about his work on the subject as I'm sure you'll find it very worthwhile reading.

I admit the code section should have more explanation. I am still optimizing the code which I managed to shave off 40% of the original timing yesterday with the use of next_permutation. When the source code changes have stabilized, I'll improve explanation on the code workings and add more comments to the source code. Thanks for the link, I have read that article before I wrote my 2nd permutation article.