This question came from our site for professional and enthusiast programmers. Votes, comments, and answers are locked due to the question being closed here, but it may be eligible for editing and reopening on the site where it originated.

There are either too many possible answers, or good answers would be too long for this format. Please add details to narrow the answer set or to isolate an issue that can be answered in a few paragraphs.
If this question can be reworded to fit the rules in the help center, please edit the question.

46

What did you say?
–
Oli CharlesworthMar 31 '11 at 11:08

46

The question is stupid. the only reasonable answer is "profile and see what is faster"
–
BЈовићMar 31 '11 at 11:21

7

Case x is faster because 1) profiling showed as much, 2) profiling revealed as much, and 3) profiling barfed on case y. Alternative answer: depends on system cache size, code size of the functions, what the functions do (what data they access, which could make the previous point moot) etc...
–
rubenvbMar 31 '11 at 11:25

33

@VJo: I disagree. There are important differences between Case 1 and Case 2. In some cases, the compiler may be able to transform one into the other (I don't know), but fundamentally there are some important lessons here, independent of whether there's one right answer. I believe this is a good interview question, because it should provoke an interesting discussion about CPU architecture, locality, optimization, etc.
–
Oli CharlesworthMar 31 '11 at 11:26

17

@VJo: without understanding the reasons why each version might be faster, how would you even know that it's worth trying both versions?
–
Mike SeymourMar 31 '11 at 11:30

I think this is the first answer that makes sense in the given context.
–
AstonMar 31 '11 at 14:10

4

The more living variables an algorithm requires, the more hardware register you (the compiler) is likely to use to hold them. This possibly leads to use memory to save temporary data (spilling), which may slow the execution of the code (cache miss, dependencies etc ...). The hardware has limited resources (e.g. registers), the more complex is the algorithm, the more pressure is on those resources allocation.
–
Laurent GMar 31 '11 at 19:19

Why branch prediction? That should be the same in both cases.
–
James KanzeMar 31 '11 at 11:19

3

@James: If A, B and C each have many branches, then constantly switching between them (as in Case 1) may constantly invalidate the prediction table. Unless, of course, the compiler is able to transform Case 1 into Case 2.
–
Oli CharlesworthMar 31 '11 at 11:21

@VectoR: if the three blocks between them contain more code, or use more data, than will fit in the CPU cache then, in the first case, by the time C has finished, some of the code/data needed by A will need to be fetched again from a slower cache, or even from main memory (which is much slower than the caches). In the second case, each block has a better chance of keeping all its code/data in the cache.
–
Mike SeymourMar 31 '11 at 11:27

In addition to the reasons already mentionned (locality favoring 2, loop
overhead favoring 1), it's possible that the compiler could optimize
them differently: case 1 gives it the possibility of interleaving
instructions from the three blocks, possibliy avoiding pipeline stall if
there are dependencies in the instructions of any one block.

Without knowing what A, B, and C are, it's impossible to actually judge the outcome, because it's easy to produce operations that are faster in either case. For example, consider if A was a string length operation and B was a string append operation on that string. In this case, doing all the lengths first will be faster than doing them interleaved with appends. However, if A is append and B is length, then you're going to want them interleaved instead of appending all first, then computing the length- assuming you have a C-style O(n) length function.

Of course they share resources. They share a runtime, a process, the whole execution environment. No resources explicitly expressed as int a; does not mean no resources shared.

I believe the point is to speculate on what A, B, or C could be to cause speed differences. Objecting that you don't know what they are is rejecting the (oddly worded) question. That said, I like your answer.
–
Kate GregoryMar 31 '11 at 12:58

1

@DumbCoder: In my opinion, it is. It propagates the notion that premature optimisation is a good idea, and penalises those who just write code, profile it, enhance it and move on.
–
Lightness Races in OrbitMar 31 '11 at 13:24

If any of the blocks have premature loop-breaks or returns, the two variants are not semantically equivalent.
–
James KanzeMar 31 '11 at 14:31

2

@James Kanze: I did not see anything in the original question that implied the two were semantically equivalent. That's the whole issue here--What causes these two variants to be semantically different enough to allow one to be faster than the other.
–
oosterwalMar 31 '11 at 14:35

In addition, here is another reason why case 2 could be faster than case 1: The first loop in case2 sets N to 0 (there will still be no resource sharing). This will only execute one A in the second case, but all of A, B and C in the first case.

If A B and C are very short/quick then the overhead of setting up the loop 3 times may dominate making case 1 marginally faster.

If one of the blocks relies on an external source for data or signals (e.g. Case 1 contains a WaitForData() function) then doing the other two while you wait for that resource to become available again will mean case 1 is faster. Note that this isn't an inter-code-block dependency.

The compiler may be able to find optimisations which are only implementable for one of the cases. It's impossible to say which would be faster without seeing the blocks though.

This might be a bit dependent upon the targeted operating environment, but if you are working with a multi-core processor then the A, B, and C blocks could be run on separate cores which could we could tend extend to assuming that the for loops could also be optimized to be run on separate cores as well before joining back into the main thread. This is kind of going with what Oli Charlesworth had to say in regards to branch prediction.

Another reason case 1 may be faster - Common Subexpression Elimination (CSE):

As an extreme example, suppose A, B, and C all contain calls to some expensive but pure function f(i) that the compiler recognizes as pure (perhaps it's defined in the same compilation unit, perhaps its prototype is annotated). The compiler could transform all three calls in case 1 to one call, but in case 2 would have to repeat the calls (unless it had some array available to stick the computed results in(!)).

Just to throw out another possiblity that I havn't seen anyone bring up...

In both cases we don't see the type of i. We probably assume it's an int, but in the case that i is an object, the i++ will be calling i.operator++(), which could have enough overhead to make case 1 faster than case 2.

If A,B and C and I are using a set unique registers, the overhead becomes the loop itself.

If A,B and C are small pieces of code able to be cached, again the overhead becomes the loop itself.

N is a function and there is an overhead in calling it 3 times more in case 2, this could be where N actually introduces a delay.

Sections can be optimized away, but N is a function or declared volatile.

Case 2 could be faster if

A,B and C are sufficiently complicated such that they cannot be made to use different sets of registers, in which case there would be an overhead in case 1 where those registers would have to be reloaded.

The question has been carefully worded to allow this. It says, very specifically, that the only resource shared between the two blocks is the iterator. That means that N is not shared between them, and could be a different value in each block.