Turns out my kernels couldn't run on the GPU because I used non-static non-const array COEFFS (the code above is corrected). When I made the array 'const static' the errors from Adreno went away and I can now see that my GPU is loaded at 50%. And it is currently giving me around 65fps on 1920x1080 video. Great success!

Just for reference -- Moto X (first gen) still gives me 16fps out of its 100% loaded dual core CPU.