February 2010 Archives

February 25, 2010

Otherwise known as "calculating a long list of numbers then adding them all up". For a GPGPU (OpenCL) simulation program at work, I needed to calculate around 160 numbers which would be averaged to produce one result for storage in a 1024x1024 element array. That's 160 numbers for each of 1024x1024 pixels, which would be a lot to store as an intermediate result for a later step of averaging on the GPU, or (heaven forbid) to be copied back to system memory.

The magic word to search for in tackling this is reduction, and there's plenty of hardcore compsci knowledge about how to make it go as fast as possible in a parallel environment. But, basically the trick is to have 160x1024x1024 threads operate in groups of 160 (one group for each of the 1024x1024 overall elements). Threads cooperating like this can share memory, and each thread writes its individual value to an array in that local memory. Then, one of the 160 threads adds up all the values and does a single write of the final average value to the global array. For the kernel to test if it's running the "chosen thread" is as simple as something like this:

if ( get_local_id(0) == 0 )

The only bit of "funny business" is that each of the 160 threads has to have finished calculating before the results can be added. That's done with this statement, which guarantees that all previous local memory writes have completed for all threads:

barrier(CLK_LOCAL_MEM_FENCE);

This is a really simple example: for one thread to do all of the averaging is a waste of resources when the reduction itself could be parallelised. In that case, one thread would (say) add up values 0-79 while another added up 80-159, then one of those threads would (after another barrier) add up the remaining two values. It's easy to see how it can be broken down more and more, and there are variations which make better use of the GPU resources, avoid memory conflicts, and so on.

So, if you'd ever heard of the thread groups and local memory used in OpenCL (also CUDA) and wondered what they were good for, now you know..

NVidia's OpenCL Programming Guide has a lot of discussion of this topic, and there's loads more to be found around the web.

February 10, 2010

My new A7+ Freerunner arrived a few days ago. Before long I'd flashed it with the latest version of SHR-testing, made sure stuff worked, and moved my main SIM over to it. Then I set about installing all the most cutting-edge unstable userspace and 2.6.32 kernel stuff on the old A6 Freerunner ready to have some fun. But I quickly noticed that something wasn't right - it wouldn't charge properly. During my first Sunday hacking session, it didn't seem to be able to charge from a USB connection to my computer, and seemed to be getting quite warm to the touch as well.

The current_now sysfs node indicated that the device was using many hundreds of milliamps more current than it should have been. So much, in fact, that the current provided to it over USB by my laptop wasn't enough to charge the battery - with barely enough power to keep the thing running the battery slowly discharged. With the higher current provided by the mains charger, the battery would charge, but more slowly that normal. But this only seemed to happen with the 2.6.32 kernel.

Lots of discussion on IRC (mostly with DocScrutinizer) followed, and it emerged that one possibility was a short circuit in the uSD card slot. Sure enough: I'd moved my uSD card from the A6 to the A7+ Freerunner (since the new one didn't come with a card), leaving the A6's SD slot empty. On my device, some of the pins protruded up, perhaps enough to short out against the metal SD slot lid:

I cut a roughly 1x1.5cm rectangular piece of plastic from the packaging of some small halogen bulbs I'd bought the previous day, and put it into the slot as a "dummy". Testing again: no current drain. Dummy removed, the drain came back, and so on. The Doc had hit the nail on the head - the culprit was found!

But there was one more piece to the puzzle. Why didn't the drain seem to happen with the older 2.6.29 kernel? Turns out that a small detail had been left out from the kernel during an earlier merge. Because of this, the Glamo SD engine's power supply was left on all the time, when normally it should get switched off when no SD card was present. The resulting waste of a small amount of power became a huge power drain when the short circuit was added. Hopefully the SD electronics aren't too fried..