It appears that there much more code (~50%) being generated for this simple loop by theIntel C++compiler, and I believe this is what is causing the poor relative performance (I examined and compared the code generated for the various functions used in this snippet, i.e., the iterator dereference, increment, not equals and constructor, and the Intelgenerated code looked okay, if anything it is slightly more concise than the MSVC++ code).

I would appreciate it very much it could be explained to me why the Intel generated code is so much bulkier, and what I can do about this. Note that this is a follow-on from a previous thread http://software.intel.com/en-us/forums/showthread.php?t=106290, and as stated there, I want to retain the iterator interface.

Without a compilable sample and a specification of which compilers you consider to be "the" compilers (e.g. 32- vs. 64- bit mode, /arch specification, ...) it's impossible to give anything like a complete answer.Intel compilers tend to be more strongly oriented toward countable loops, yet without the machinery to convert a trivial example such as this to a branch to a countable case (presumably the normal one) and the ill-formed case (e.g. case (end - begin < 0)).Trivial translation, discarding ugly case:for(ptrdiff_t count = end - begin; count > 0; --count){.....}For some, the tradition of handling the ugly case is stronger in 32-bit mode (the ugly 64-bit case will hang whether it is done "correctly" or not).For reasons of practicality, it was necessary to take a sane interpretation of some of the STL linear iterators which are vectorizable aside from the ugly case, and each compiler takes its own path there. As the VS2012 compiler has been advertised as supporting auto-vectorization but has not been unveiled to many of us, it may turn out quite different from VS2010.