The most tricky to understand is changing Y_SIZE from 30 to 32. This will activate one feature of most C compilers that converts multiples by a power of 2 (2, 4, 8, ...) into shifts. This will result in performance gains when the computer is doing pointer arithmetic to access the correspondent memory address of matrix[x][y]. The compiler will change one multiplication operation into one shift operation which is cheaper.

Looks like that there is nothing more to optimize. You can always write some assembly but it may not be good idea. The library function memset() can be used to fill a matrix. "Frequent used library subroutines like memset are often coded into assembly language and may make use of special processor-dependent tricks to do the job faster than could be done in C".

The improvements looks good, but how much efficient each optimizations are? I've measured it in clock cycles. And found that the optimization level is processor dependent.

For clock cycles, lower is better.

Results for: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz

clock cycles

times faster

matrix1()

11102.151556

1

matrix2()

6400.36597

1.7346119906

matrix3()

6379.460394

1.740296337

matrix4()

5952.497506

1.8651249404

matrix5()

2154.262528

5.153574094

matrix6()

1907.350431

5.8207193474

matrix7()

792.123493

14.0156827239

matrix8()

780.254779

14.2288799182

Results for: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz

clock cycles

times faster

matrix1()

17175.114362

1

matrix2()

8153.467501

2.1064797719

matrix3()

8063.182452

2.1300664427

matrix4()

8497.82453

2.0211189701

matrix5()

4300.083046

3.9941355035

matrix6()

4321.695819

3.9741608575

matrix7()

1569.097383

10.945856228

matrix8()

1560.792718

11.0040969335

Results for: AMD Athlon(tm) 7750 Dual-Core Processor @ 2.7GHz

clock cycles

times faster

matrix1()

25319.969906

1

matrix2()

10329.498185

2.4512294259

matrix3()

8558.934585

2.9583086136

matrix4()

9480.851235

2.6706430972

matrix5()

5544.608885

4.5665926003

matrix6()

5577.454075

4.5397002943

matrix7()

643.046753

39.3750062307

matrix8()

631.545791

40.0920570873

So, it is real! For Intel you can get 2 times faster performance by doing simple changes and not using pointer arithmetic. If you do take control of pointer arithmetic and trash some variables, the performance gain can go up to almost 6 times faster. The performance gain can reach 14 times faster by using ultra specialized subroutines. It is much better then I was expecting.

For AMD the use of the specialized functions can result in speedup of more than 40 times.

The clock cycle count is not an integer because the values shown are average mean of 256 measurements.