The use of " if ( abs(x) > 1.E15 ) cycle" may restrict optimisation, but avoiding all the temporary arrays would improve cache usage.
In general, do you find the use of SUM ( array*array ) an efficient approach ? It would not be my choice, but worth considering.
DOT_PRODUCT is an alternative, but is not typically optimised.

interestingly, I further modified the program to do a variable number of "kk" loops and found that, for 5 loops, "dot product" was slightly faster but for 100 loops, "sum" was faster (which probably means they are actually identical for all practical purposes!)

Please note that the initial post of this thread stated that a floating point stack overflow occurred. That is, the generated code probably attempted an FLD x87 instruction at an instant when the registers ST0 to ST7 were already occupied. The compiler should have kept track of how many x87 registers were used up, and should have used the CPU stack space as scratchpad space if necessary to relieve the pressure on the FPU register stack.

I have seen other instances where 32-bit FTN95 had similar problems. Changing the CPU stack size (by using linker options, etc.) will have no effect on this problem, so it is important not to confuse FPU stack overflow, which applies only to the x87 segment of the processor, with the more common stack overflow in main memory.