OK, this is a bit difficult to explain. First, the reason why the second one works, is because it is automatically unrolled. You loop a constant amount of iteration, so the compiler can silently unroll without side effects.

Now, the reason the first doesn't work as expected, is due to the fact that derivatives (as used by a gradient instruction, such as tex2D) are undefined within a conditional statement (such as the one implicitly used by the loop).

You can solve this by either not using derivatives at all (ie. supplying a constant LOD through tex2DLod rather than the tex2D) or by manually computing the derivatives outside of the loop, and supplying them explicitly. If you only want to blur a screen aligned rectangle, then the best method is the former one.

The compiler needs to prove that the loop will terminate within 1024 iterations to target the ps_2_0 profile. Since your index value comes from a global, no range information can be infered beyond the 32-bit int type. You can force this by doing something like:

GPUs shade a 'quad' of 4 pixels at a time and the gradients used for texture fetches come from the finite differences calculated from adjacent pairs of pixels. This is how the GPU is able to generate partial derivatives even for arbitrary expressions in the pixel shader. The tex2Dgrad function can be useful if you can calculate more accurate analytical derivatives for the values you are passing in as texture coordinates.