I think the first programs should be much faster than the second one, since it's more cache friendly. However, the truth is the second one is a litter faster. On my server, the first one takes 23s while the second one takes 20s, can some one explain this?

You're not seeing cache-friendliness advantages because the access pattern is still much too simple even in the version you predict to be slower.

Two (or more) concurrent streams of straight-line input is something a modern CPU can detect and stream into L1 ahead of it being needed.

It can also allow multiple SDRAM banks to be put to useful work at the same time. If you're using Linux you don't get much control over that because pages are mapped randomly (I think; is this still true?), but you can try allocating memory using mmap() with the MAP_HUGETLB argument and then try different offsets from the start of the allocation.

If you want to see the advantage of arranging your computations in a cache-friendly order you should perhaps experiment with different access patterns in two-dimensional arrays.