Introduction

This article is intended to introduce software
developers into the topic of optimization techniques. For this, different optimization techniques will be explored.

As
a first step, I have chosen an easy to
understand algorithm to which I have applied various optimization techniques:

The problem we will solve is the 3n + 1 problem (details): for every number n between 1 and 1000000 apply the following function:

until the number becomes 1, counting the number of time we applied the function.

This
algorithm will be executed for all the numbers between 1 and 1000000. No input
number from the keyboard will be read and the program will print the result,
followed by the execution time (in milliseconds) needed to compute the result.

Prerequisite

N/A

Different
implementations for the same problem

The
initial version of implementation: for each number between 1 and 1000000, the above
mentioned algorithm will be applied, generating a sequence of numbers until n
becomes 1. The steps needed to reach to 1 will be counted
and the maximum number of steps will be determined.

I compiled the code for both Debug and Release builds, both 32 bit and 64 bit version. I then ran every executable 100 times and computed the average time(ms) it takes to do the calculations.

Here are the results:

C++ Debug

C++ Release

C# Debug

C# Release

x86 version

6882.91

6374.50

6358.41

5109.90

x64 version

1020.78

812.71

1890.36

742.28

First thing to be observed in the table is that
the 32 bits program versions are 5 to 7 times slower than the 64 bits versions.
This is due to the fact that on x64 architectures one register can hold a long long variable and on x86 we need 2 registers. This means that on x86 operations with long long values are slow. Because of this we will not examine the 32 bits anymore in this
article.

Second thing to be noticed is the difference
between Release and Debug builds and, also, that for C# the differences are
bigger than for C++.

Another
observation is the difference between the C# Release version and C++ Release
version. This, together with the previous observation, makes me believe that the
C# compiler performs optimization better than the C++ compiler (maybe even
employing some of the optimization techniques we are going to talk about
later).

The
first optimizations I will apply are related to performing the mathematical
operations faster by replacing the conventional way of doing them with an
unconventional way.

If
we look at the above code we see that we have only 3 complex mathematical
operations: modulo 2 operation(%),
multiplication by 3(*) and division by 2(/).

First operation I will optimize is the modulo 2.
We know that all numbers are represented in memory as a sequence of bits. we
also know, the representation of an odd number will always have its last bit 1(5
= 101, 13 = 1101, etc.) and the representation of an even number will always
have its last bit 0( 6 = 110, 22 = 10110). So if we can get the last bit of a
number and test it against 0 we know if a number is odd or even. To get the
last bit of a number I use the bitwise AND operator(&).

In C++, replace:

if ((nNumberToTest % 2) == 1)

with:

if ((nNumberToTest & 0x1) == 1)

In C#, replace:

if ((iNumberToTest % 2) == 1)

with:

if ((iNumberToTest & 0x1) == 1)

Here are the results:

C++ Debug

C++ Release

C# Debug

C# Release

922.46

560.86

1641.41

714.10

C++
Release version benefits most from this optimization. The difference in
improvement between the C++ Release and Debug versions leads me to believe that
the compiler is able to remove more instructions in the Release build with the
new optimization algorithm.

C#
seems not to benefit too much from this optimization.

The
next operation I will try to optimize is the division by 2. If we look again at
the binary representation of the numbers, we can observe that when we divide by
2 we discard the last bit of the number and we add a 0 bit before the remaining
bits. So 5 (=101) / 2 = 2 (=010), 13 (=1101) / 2 = 6 (=0110), 6 (=110) / 2 = 3
(= 011), etc. I will replace this operation with the bitwise right shift
operation that produces the same result.

In C++, replace:

nNumberToTest = nNumberToTest / 2;

with:

nNumberToTest = nNumberToTest >> 1;

In C#, replace:

iNumberToTest = iNumberToTest / 2;

with:

iNumberToTest = iNumberToTest >> 1;

Here are the results:

C++ Debug

C++ Release

C# Debug

C# Release

821.58

555.96

1432.01

652.11

C++
Debug, C# Debug, C# Release version gain between 65 and 200 milliseconds from
this optimization.

C++
Release gains almost nothing from this replacement probably because the
compiler was already performing this optimization.

Last
mathematical operation that consumes time is the multiplication by 3. The only
thing we can do to this operation is to replace it by additions.

In C++ replace:

nNumberToTest = nNumberToTest * 3 + 1;

with:

nNumberToTest = nNumberToTest + nNumberToTest + nNumberToTest + 1;

In C# replace:

iNumberToTest = iNumberToTest * 3 + 1;

with:

iNumberToTest = iNumberToTest + iNumberToTest + iNumberToTest + 1;

Here are the results:

C++ Debug

C++ Release

C# Debug

C# Release

820.84

548.93

1535.28

629.89

The
biggest performance gain can be observed in the C# Release version, followed by
the C++ Release version.

C# Debug version shows a decreased performance due
to the fact that the current software version executes more instructions than
the previous one and the compiler can not optimize the instructions (it can not
replace them with anything else because we might need to set a break point on any of them).

There
is one last mathematical optimization we can perform based on some special instructions that the
processor implements. These instructions are the so-called conditional move
instructions. To determine the compiler to generate a conditional move
instruction, I will replace the IF statement (which checks if the number is odd
or even) with the ternary operator( ?: ).

To
be able to implement the optimization mentioned above we need to modify the
problem statement. If the number is even, it will be divided by 2 (as imposed
for the problem). If the number is odd then it can be expressed as 2 * n + 1. Applying
this modifications to the initial form of the function we will obtain:

From the above equation we can see that we can perform
2 steps of the algorithm into 1. We will rewrite the algorithm so that we
compute next value of the number to test, assuming the current value is even.
Then we will save the value of the last bit of the current number to test. If
this value is true, we will increment the current cycle count and add the current
number + 1 to the next value of the number to test. (Note: this optimization
will become really important in one of the next articles when I will talk about
SSE).

Both debug builds show a slowdown, because we
are now executing more instructions compared to the previous versions of the
code and the compilers can not optimize them.

The
C# Release version shows a slowdown because there are no conditional move
instructions in C#.

The
power of this category of instructions is proved by the increased speed of the
C++ Release version.

It
can be noticed the I did solve the problem using recursion. For this problem, a
recursive algorithm would be extremely slow: the maximum cycle length is 525,
so assuming that most of the numbers have a cycle length of around 150 (just a guess,
not actually verified), if we have 150 recursive calls for every number between
1 and 1000000, we would have to perform 150000000 calls. This, clearly, is not
a small number and, because calling a function takes a lot of time, recursion is,
definitely, not a good solution for this problem.

Points of Interest

It's
time to draw the conclusions:

Modulo and division operation take a lot of time and they should be replaced by
something else.

Try to analyze the problem and obtain an alternate representation of the
problem.

Try to eliminate the IF statements from your code in the case that their only
purpose is to set some values based on a condition.

The
next time topic will be about how to make our program faster, using threading
in C# and C++.

History

27 May 2012 - Initial release.

28 May 2012 - I would like to thank anlarke for pointing out things that could be improved in the article and for submitting his code (C++ Debug time: 546.76 ms, C++ Release time: 386.35 ms). Also I would like to thank Reonekot for his clarification on the WoW topic. He is right and the performance problems are caused by the fact that the registers are 32 bits (for x86) and 64 bits (for x64).

Comments and Discussions

Thanks for your nice article. I have found a fine article also. It is for C#. I want to share it

We should follow standard naming convention.
The variables and methods/functions name should be relevant and small as far as possible.
Use X++ instead of X=X+1. Both returns the same result but X++ use fewer characters.

My vote is not about your code working or not, but about the fact that it is shown as a beginners' article.
Beginners may understand (or think they understood) your optimizations and start to use it, unnecessarily.
It is extremely rare that we have a problem where the mathematics optimization is doing any good. Generally, using a better high-level solution is better.

What I mean is: Use a hashset instead of a list. Use a dictionary and cache your results instead of recalculating them. And things like that.
I will not say that your optimizations are useless, but they are not for beginners. Beginners should learn how to use the best resources without complicating code... experts may complicate code to search the best performance.

Ok, I've got a live example on that:Consider a particle system in a game that draws rain using thousands of particles by operating 3D vectors in real time. Now there is a frame (about 60 of them per 1 second) in which all those particles are drawn. So, in every particle (consider a class) there are at least Update() and Draw() functions which operate intensively with 3D vectors. So. If you're going to optimize something like<pre lang="c++">inline Vector operator+(const Vector &v) const;</pre>or other operators, or even something just inside those Update() and Draw() functions, you're definitely going to have significant FPS improvements. And yes, sometimes you have to leave the beauty of human-readable operations behind double slashes. PS: also use SIMD.

Then if you decide you really DO need to "improve" (AKA obfuscate) some highly-used sections, go read http://graphics.stanford.edu/~seander/bithacks.html

Note that several of the optimizations the author suggests here are actually bit-hacks, and may ONLY work on twos-complement machines. The C++ standard makes no guarantees about how the bits are laid-out in any specific machine (except that >>1 and /2 are guaranteed to have the same effect for unsigned ints).

Finally, if you want some well-written general advice for optimizing, from someone who knows what they're talking about, go start looking at http://developer.amd.com/documentation/articles/pages/6212004126.aspx and http://developer.amd.com/documentation/articles/pages/7162004127.aspx

The primary criticism on this article seems to be that it is better to make the algorithm more efficient than try to get the most efficient little bits of math. To that end, I agree with those remarks.

In your example, you are linearly testing every number. Therefore you know that all smaller numbers have already been confirmed. You don’t need to continue once you have a number smaller than your original test number.
Also, we know that every even number will eventually shrink down to an odd number that we have already tested. (e.g. 12 / 2 = 6. 6 / 2 = 3. 3 was already tested.) So, we don’t have to test any even numbers.

Your Program_v1.cs takes 983 ms to run on my PC. After the above optimizations, the time is reduced to 34 ms. This is an improvement vastly more significant than any in either this article, or part 2 of the article.

//current maximum cycle length
publicstaticint MaxCycleCount = 0;
//first number
privateconstint FirstNumber = 3; // !! we already know 1 & 2 - start at 3 for test comparison in while loop.
//second number
privateconstint SecondNumber = 1000000;
//function used to solve the problem
publicstaticvoid Solve()
{
//for every number between the first and the last
for (int i = FirstNumber; i < SecondNumber; i += 2) // !! only test odd numbers
{
//cycle count of current number
int iCurrentCycleCount = 1;
//current number
long iNumberToTest = i;
//while the current number is not 1
while (iNumberToTest >= i) // !! we are going linearly, so we know all smaller numbers are already confirmed.
{
//test if the number is odd
if ((iNumberToTest % 2) == 1)
{
//the number is odd
iNumberToTest = iNumberToTest * 3 + 1;
}
else
{
//the number is even
iNumberToTest = iNumberToTest / 2;
}
//increment cycle count of current number
iCurrentCycleCount++;
}
//if current number's cycle count is greater than the maximum cycle count, set maximum cycle count to current number's cycle count
if (iCurrentCycleCount > MaxCycleCount)
{
MaxCycleCount = iCurrentCycleCount;
}
}
}

I agree with your observations. But your assumption that we are testing each number linearly is not valid in case you use threads.

Also, I see that your algorithm does not work correctly. You stop processing when you reach a number that has been already processed but you do not take into account the cycle number of the number you reached. To do that you need to keep track of the cycle numbers of all the numbers that you process.

However the purpose of the articles is not to find an optimized algorithm for this problem, but to provide tips on how to approach an optimization problem.

Anyway, I will update part 1 of the article(when I have time) with your observations.

Yes, my code would have to be updated to have CycleCount represent how long it takes to get to 1 instead of how long it takes to prove that it will go to 1. Since you weren’t doing much with it, I just took the simplest approach. Your suggestion of storing the values would work.

I think that I fulfilled your purpose of wanting to “provide tips on how to approach an optimization problem”. My point is “Optimize your algorithm. It is more important than the minutia of mathematical operators.” (Yes, optimizing mathematical operators can buy you a bit; but you are better off looking elsewhere.)

The fact that you didn't take into account the CycleCount proves that you did not read the problem statement carefully.

The fact that you suggest the 3 * n + 1 optimization proves that you did not read the article.

Other than that you did not fulfill the purpose of this article, because you provided a particular solution for this problem, based on a fact that you assume that is always true(the fact that we process the numbers in a linear ascending sequence). If you read the article you will see that I make no such assumption, so the "random optimizations" I describe work without any problems when using threads for example.

If you are suggesting that I didn’t fully read Part 2, then you are correct. I had only skimmed part 2 for the test results. I apologize for not checking both parts before providing the suggestion.

There are still larger optimizations to the algorithm (for both efficiency and readability) even if you require checking down to 1 for every number. The code at the bottom of this post shows my attempt at this.
For efficiency, I find the least significant bit that is set in the number. This allows me to divide by 2 until an odd number is found. This is only more efficient because there are many times where we have to divide by 2 multiple times in a row. This would probably not be a practical change if you switched from ulong to BigInteger.
For readability, I used the new .NET 4 class: Parallel. It basically does everything that you did with your processor count and work queue. In my basic tests, this class seemed comparable in performance to doing it manually. (I didn’t look at this part too thoroughly.)

You assert that you don’t assume that the numbers will be in a linearly ascending sequence. Do you mean that you wanted this to work where you only check a single large number? I made the assumption of a linear sequence because that is what you were doing. Even if you had multiple threads going, you were still checking all numbers from 1 to n. (As was mentioned with saving the partial count.)

I still hold that optimizing the algorithm is much more important than optimizing individual mathematical operations. However there is also the idea that you shouldn’t optimize too early. Your original code worked, was easy to understand, and still ran fairly quickly. Unless you need it to run faster, there is no point in optimizing. You shouldn’t sacrifice maintainability for performance just because you can. If you need to make you code run faster, than do so; but if not, don’t.

I also agree with you that you should really have a speed problem before starting to optimize.

And I especially agree with the part about the trade off between readable code and faster code.

About the assumptions, I tried not to make any assumptions(at least in this part of the article)(the last version in the second part assumes that the user has the queues filled in). So, with what I presented the user of the program is free to process the numbers how he wants, for ex: run the algorithm for 1 large number or define another interval to run the algorithm on or to process the number backward, etc.

Also about the parallel library, I tried to show the lowest level implementation.

In general, on the point of libraries my philosophy is use the library only if you know how to implement the operations it performs. Using a library blindly without knowing how it does it's work can introduce bugs in your application.

In general, it is futile to try to hand-optimize code because modern compiler optimizing algorithms typically do a better job than most programmers could. There are a few exceptions, of course. Integer arithmetic is one of them. Integer operations have to by definition handle the general case, so there are specific conditions like the one in the article where hand-optimization might produce faster code.

Having said that, as has already noted in the responses, obfuscating code like this is generally a very bad idea unless there is an overwhelmingly large advantage to be gained or a severe constraint to address - such as strict performance requirements. Thus, it is not for everybody or every circumstance.

A few notes on the actual work, though.

1) The ?: notation is a conditional comparison. It is just like an if...then, but in shorthand notation. Changing the form does not change the operation, so in this case, you have substituted two if...then operations in shorthand to remove one if...then operation in longhand. That's why it slowed down.

2) One optimization you have missed is that the 3N + 1 operation can be replaced in its entirety with

( N << 1 ) + N + 1

The bitwise shift operations are atomic and thus one of the fastest operations that can be performed. A left-shift is faster than an add, where it is possible to do so. In this case, the equation 3N + 1 is exactly equivalent to 2N + N + 1. left-shift is equivalent to multiplying by two, just as right-shift is equivalent to dividing by two.

Try this as an alternative. It is easier to read, and I think it will prove among the fastest.

No doubt about article... excellent. I just want to pointout one typo mistake.
1> your last optimization formula should not be like this, if CurrentNumber is Odd then
NextNumber = 2 * (CurrentNumber + CurrentNumber/2 + 1/2 );
which makes NextNumber = 52 from currentNumber = 17 to be correct number in series.But
you wrote
NextNumber = 2 * (currentNumber + currentNumber/2 + 1); does not gives 52 from 17...

Hi Razvan
That is right.
main_V5.cpp at my Intel Core 2 (bit old box) consume arround 210 to 920 ms, but after changes signed to unsighned and simple while loop change as in following code snippet consume around 100 ms less.

//first number
int nFirstNumber = 1;
//second number
int nSecondNumber = 1000000;
//maximum cycle count
int nMaxCycleCount = 0;
//Function used to solve the problem
void Solve()
{
//for every number between the first number and the second number
for (int i = nFirstNumber; i < nSecondNumber; ++i)
{
//cycle count for current number
int nCurrentCycleCount = 1;
unsignedlonglong nNumberToTest = i;
int nOddBit = 0;
unsignedlonglong nTempNumber = 0;
//while current number is not equal to 1
while (true) //nNumberToTest != 1)
{
//compute current number last bit
nOddBit = nNumberToTest & 0x1;
//divide by 2
nTempNumber = nNumberToTest >> 1;
// break if reached to zero;
if (nTempNumber == 0) break;
//if number is odd add the current number + 1 to the result of the division by 2.
nTempNumber += nOddBit?nNumberToTest + 1:0;
//if number is odd increment current cycle count by 2, else increment it by 1
nCurrentCycleCount += nOddBit?2:1;
//update the value of the current number
nNumberToTest = nTempNumber;
}
//if the current number cycle count is greater than the current maximum, set the current maximum to the current number's cycle count
if (nCurrentCycleCount > nMaxCycleCount)
{
nMaxCycleCount = nCurrentCycleCount;
}
}
}

As others have said the compiler will do a lot of optimization for you and other techniques, such as choosing fast algorithms have a major impact on speed. One of the most important things to improve performance is to use a good tool to determine where the bottle necks are located then only optimize those time-critical portions of the code. One of the often overlooked tools is the CLR Profilier from MS, which can show if garbage collection if being done inefficiently and where those problems are located. These techniques will have a greater impact on speed of execution than your tweaks. That said your ideas are good to know and a programmer should have a variety of techniques in their back pocket so I do feel that your article has some value.

I bet you never want to maintain that code - this is crap. I would blindly dump it and rewrite in a decent way, i.e. in a way that one understands the algorithm. You might replace on an integer value the ...% 2... by something like ... & 0x01, but never without commenting the meaning.

I doubt that you would want to have this kind of code in your productive environment.