There are a few issues here. First of all you are being very, very unfair to the GPU!! You are forcing him to do the 2000000 calculations 128 times in parallel! The CPU only does one cycle through the 2000000. You can also split up
your 2000000 and launch far more threads in parallel.

Secondly, be very careful when timing device code. Preferably make use of the built in timing and synchronization functions of the GPGPU class. Invoke also has unpredictable and significant overhead. If you change the Launch args to (1,
1, ....) and add

gpu.Synchronize();

after the Launch you'll see that the Launch is basically asynchronous.