Abstract—Single threaded applications cannot catch up with parallel systems when it comes to scalability mainly because of the clock frequency limitation of modern CPUs and economical reasons. Using shared memory or distributed memory architectures parallel systems can provide tremendous speed up compared to single threaded systems. CUDA, a shared memory parallel software, is considered to be a powerful language because of its easy thread management aspect and support for GPUs. Gauss blurring is a well-known image processing technique which reduces image noise and detail. Because of the high computation requirement of this technique, single threaded applications exhibit poor performance. In this paper we show that orders of magnitude speed up can be achieved by carrying out this operation on CUDA architecture with the help of high parallelism.