Very cool. Might be worth adding a note that CLArrays at least isn’t yet working on Julia 1.0 I think. I fired up a REPL to try to follow along but wasn’t able to because I don’t have an NVidia GPU ]add CLArrays failed to find a version compatible with Julia 1.0.

Pretty awesome stuff. One thing that isn’t clear is that is writing a loop fast for GPU or does the operation have to be vectorized for it to be fast?

Basically ‘sum’ will be fast but is writing a loop to sum every element of the loop as fast. I caan test this particular case but are there principles that i van follow in general?
Also how do i invoke the GPU’s sorting functions to sort a large vector?

Pretty awesome stuff. One thing that isn’t clear is that is writing a loop fast for GPU or does the operation have to be vectorized for it to be fast?

Any function you write for the gpu will get executed in parallel, there is almost no opting out of that. So a most basic kernel call on the gpu is in fact already a loop, since it will always get scheduled to be executed a couple of times in parallel with a different invocation index. So this sets up a different context for the whole “do I need vectorization” discussion. Of course vectorization is a very easy way to profit from this execution model, but you might as well write a gpu function that uses loops and is fast - as long as you can run that function a couple of 100 times in parallel!
So you could indeed write a loop over a large array in parallel, and have smaller sum loops inside the big loop and its fast

Also how do i invoke the GPU’s sorting functions to sort a large vector?

Implement it Or wrap a library that already does it. Radix sort would be a nice start - there should be plenty of open source gpu kernel one could build uppon!
Contributions are more than welcome

Let’s try to figure out what this is doing! In simple terms, this will call the julia function kernel length(A) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B. If we calculated our own indices instead of using linear_index, we’d need to make sure that we don’t have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:

Maybe i should summarise where i am confused. I get the feeling from the post that any function that cpu works on isbits arrays will work onthe GPU. I also get that when you call a function on the gpu it calls it “multiple times in parallel”. So when i call a function with a loop in it, does it run the function multiple times or it somehow figures out that it should parallelize the loop. Or is writing a loop inside a function a bad way to program GPUs?

Any kernel you want to call needs to get scheduled N times in parallel - so that’s the implicit loop you won’t get rid of. Inside the kernel you can use whatever you want, including loops which then get run in parallel

somehow figures out that it should parallelize the loop

No, you will need to remove the loop you want to parallelize and use the implicit loop via the parallel invocations.