Coder's Revolution

In a recent pissing match between ColdFusion and PHP, Jared Rypka-Hauer was demonstrating the performance of a function that generated prime numbers. The discussion really wasn't about the BEST prime generator as much as it was about how much ColdFusion can kick PHP's puny butt all over town. Never the less, I piped up in the comments to ask Jared to compare a prime number generator that I wrote a while back based on the Sieve of Eratosthene. After Jared asked some good questions about how my code worked I figured it was time I stopped high-jacking the comments of the PHP pooper train. I decided to spin off a new post to highlight some significant performance gains I was able to produce.
Now, here comes the boilerplate:

I readily admit there are few real-life needs for this kind of function.

Analyzing stack traces for hours in order to shave a few milliseconds is purely for academia.

Yes, if milliseconds really mattered, I'd be doing this in Java, which for the record, would be very easy to accomplish with ColdFusion's rampant awesomeness.

A lot of my performance advice in this entry is pretty useless for day-to-day code. Most apps don't do things millions of times, and this level of hair splitting would be overkill.

This entry has nothing to do with PHP, but ColdFusion still kicks it's butt.

Background

If you haven't read the original post on my implementation of the Sieve of Eratosthenes this would be a splendid time to do so. If you don't really need this credit for your major and any passing grade will do, at least read the Wiki page for the algorithm.
Here's the basics: Some algorithms try find all the prime numbers outright up to a given number. The Sieve of Eratosthenes does the opposite. It generates a list of all possible candidates and then goes through and eliminates every number which is divisible by another digit other than itself and 1. When it is finished, everything left is a prime.

Starting Point

Here is the what the function looked like last time I finished fiddling with it:

[code]
<cfscript>
function getPrimes3(num)
{
var p = 1; // Counter for outer loop over array
var i = 0; // Counter inner loop finding multiples of each array index
var m = 0; // holds multiples
var candidates = []; // main array that holds values 1 through num
var primes = []; // final array that primes are copied to to return
// populate candidate array with every number between 1 and num
while (++i &lt;= num) candidates[i] = i;
// Loop over every index in our candidate array
while(++p&lt;arrayLen(candidates))
{
// Reset inner counter for each outer loop
i = 1;
// If the number at this index has already been marked as "0" (not prime) skip it.
if(!candidates[p])
{
continue;
}
// Loop from 1 to num to find all multiples of the current index
// represented by the outer-loop value of p
while (++i &lt;= num)
{
// m is the product of p and i and is therefore not prime
m = p*i;
// As long as m is in the array
if(m&lt;=num)
{
// Mark it as not prime
candidates[m] = 0;
}
else
{
// if the product of p and i were greater than num,
// then break out of the loop.
break;
}
} // end inner loop
} // end candidate array loop
// Re-use i for this final loop
i = 0;
while (++i &lt;= arrayLen(candidates))
{
// Every number left in the candidate array greater than 1
// and less than num is a prime we care about.
if(candidates[i] > 1 && candidates[i] &lt;= num)
{
// Add it into the primes array
primes[arrayLen(primes)+1] = candidates[i];
}
} // End while
// Return our list of primes
return primes;
}
</cfscript>
[/code]

Note: for whatever reason, this version only found primes up to but not including the number you passed in.
Here's a basic run-down of what it does:

Loop and create an array with every number from 1 to num.

In an outer loop, loop over each index in that array.

If this index has already been marked as non-prime (value set to 0) skip ahead.

Inner loop multiplies value of current array index by every natural number (1, 2, 3, 4, ...) until the product is greater than num.

For each product found, zero-out that index in the array since that number cannot be prime.

Finally, loop one last time over the array and copy the non-zero values (primes) into a second array and return it.

The Question

Jared asked me the following question in the comments of his blog:

However, I am curious... if 2 is the only odd prime number, why are you checking every number from 1 to n when you could just check the odd ones and cut out half the processing? I tried to adapt your function to that but it assumes that arrayLen(candidates) == num and breaks when you try to do anything else.
Any thought on updating it to that and seeing if there's any performance boost?

My Response

My response is below. As I typed it however, I began to wonder if there WAS additional gains to be had in my code.

Why don't I skip the even numbers? Well, technically I do. Well, I short circuit them anyway. In fact, I short circuit every number which has already been marked as a non-prime. That is what the following line of code does:
if(!candidates[p]) continue;
If the outer loop reaches a number in the array which has already been marked as a non-prime, it skips the inner loop right then and there. For instance, when the multiples of 3 are found-- 3, 6, 9, 15, etc will all be marked non-prime. Therefore there is no need to find the multiples of 6, 9, or 15 etc since their multiples are already a multiple of 3 and will have already been marked off. That means the list of array indexes that are allowed to be considered diminishes quickly as the outer loop grows. Just like the Wikipedia examples show the list shrinking. I'm leaving everything in my array, I'm just ignoring it.
I believe this may be why my function does well with larger numbers of primes because it is optimized to quickly whittle the list down on every pass of the inner loop so each increment of the outer loop skips more and more numbers.

The Answer

Ok, let's cut to the chase. Here's the revised version of my function, and it's faster. MUCH faster.

[code]
<cfscript>
function getPrimes4(num)
{
var p = 1; // Counter for outer loop over array
var i = 1; // Counter inner loop finding multiples of each array index
var _p = 3 ; // Stores the value being stored at index "p" in the candidate array
var m = 0; // holds multiples
var candidates = []; // main array that holds values 1 through num
var primes = []; // final array that primes are copied to to return
var pLimit = sqr(num); // Only
// Populate the candidate array with every odd number from 1 to num
// Throw 10 indexes in the array at a time to cut the number of
// times "i &lt; num" is evaluated by 90%
while (i &lt;= num)
{
arrayAppend(candidates,i);
arrayAppend(candidates,i+2);
arrayAppend(candidates,i+4);
arrayAppend(candidates,i+6);
arrayAppend(candidates,i+8);
arrayAppend(candidates,i+10);
arrayAppend(candidates,i+12);
arrayAppend(candidates,i+14);
arrayAppend(candidates,i+16);
arrayAppend(candidates,i+18);
i = i+20;
}
// Short circuit this scenario
if(num &lt; 2)
{
return primes;
}
// 2 is the only even prime number.
// Go ahead and stick it in the primes array if num is greater than 1
// since I won't be finding multiples of 2 later.
if(num>1) primes[1] = 2;
// Loop as long as the value of index p has not surpassed the square root of num
while(_p&lt;=pLimit)
{
// The first time through, p = 2 will will be a _p value of 3
// in other words, start by finding multiples of 3
p++;
// If the number at this index has already been marked as "0" (not prime) skip it.
if(!candidates[p])
{
continue;
}
// _p is the next prime number.
// It is also the number we are about to start finding multiples of.
_p = candidates[p];
// Since we know _p is prime, add it ot the primes array
// now while we're thinking about it
arrayAppend(primes,_p);
// Start calculating multiples with a product we haven't multipied yet. Namely _P^2
i = _p;
// Start loop over every odd number from -P
while (i &lt;= num)
{
// This product is NOT prime
m = _p*i;
// Assuming the product falls in our array
if(m&lt;=num)
{
// Mark it as not prime
// Since my array contains only prime numbers, a given value of m
// will be found at the index which is half of m+1.
// i.e. The number 15 will be stored in the 8th index of the array.
// This trick is important since I don't want to perofrm
// the expensive operation of searching through the array.
candidates[(m+1)/2] = 0;
}
else
{
// IF our product is outside the range we care about,
// break out of the loop.
break;
}
// increment by two to only hit odd numbers
i=i+2;
}
}
// Note, the primes array is already partially populated at this point.
// Begin looping at the current value of p to pick up where we left off.
// For every remaining index in our candidate array...
while (++p &lt;= arrayLen(candidates))
{
// ... that is not been marked prime and is less than our limit ...
if(candidates[p] > 1 && candidates[p] &lt;= num)
{
// Place it in the final array of primes.
arrayAppend(primes,candidates[p]);
}
}
// Return our list of primes
return primes;
}
</cfscript>
[/code]

The main thing I did was put in some debugging writeOutputs throughout the code and test a simple scenario like primes under 20. That allowed me to see parts of the code that were looping unnecessarily and could be short-circuited sooner. For the record I tested on my home PC with 32-bit CF8, a single-core 2.19 GHz CPU, and 512 MBs of heap.
Here's what I changed and why:

pLimit is the square root of the max number. There is no need for the outer loop to continue past this number since you will have reached all the divisors of num by that point (and therefore marked all non-primes). For instance, if num is 100, pLimit is 10. All non-prime numbers under 100 are divisible by a number less than or equal to 10.

I still can't believe this next one. This ONLY made a difference once I started getting close to 1 million, but the time spent populating my initial array was spending a crap ton of time evaluating the "while (i < num)" line of code since i and num have to be cast to a Java double data type behind the scenes on every pass. Appending 10 items to the array on every pass actually cut the time it took to build the array in half for very large numbers. The only downside is a few extra records in the array that will be ignored.

Don't put even numbers in the array! In order to avoid the costly exercise of searching the array, my function requires the ability to know what index to look in for any any given number in the array. Only storing odd numbers cuts the size of the array (and the memory requirements) in half. Want to know what index is storing the value m? Easy, it's in (m+1)/2!

Why wait until the end of my function to build the second array. Every _p value that I reach that is non-zero is a guaranteed prime, so I might as well add it to the primes array in the main outer loop and save myself the time later. Since I short-circuit out of the outer loop once I hit the square root of num, the final loop is still necessary, but notice I start at the current value of p so half of my work is already done. Looping is cheap, but looping 5 million times adds up.

Start the inner loop at the current value of the outer loop, not 1. The reason is this, if I'm looking at the number 5, there's no need to multiply it by 2, 3, and 4 so I can mark 10, 15, and 20 as non-primes. I already hit those when I was looking at 2, 3, and 4. In other words, 2 x 5 is the same as 5 x 2, so don't bother with it. Start by multiplying two numbers you haven't tried before, like 5 x 5. This removed a lot of redundant multiplication.

Only multiply by odd numbers. Anything times an even number is going to be an even number, and I've already ruled all of them out. I increment i by 2 every time and eliminated half of my inner loops.

Obligatory Statistics

So by eliminating needless looping and short circuiting when possible, I was able to get gains of about 75% over my original version of the function. The new code is also up to 85% faster than Jared's version, but it ONLY really shines when you start getting up to the primes under 15-20 million and up. Unfortunately, due to memory limitations, I wasn't able to get my function to run for anything over about 40 million. Apparently, building an Array with over 20 million indexes takes a lot of memory.
I leave you with this graph. Jared's function uses a different algorithm, so it's not much of a comparison, but I threw it in there just for reference. I pushed my latest version to all the primes in 40 million before I started running out of heap space, but I dropped off the other two functions from the test after 20 Mil because the times of the test were getting exponentially larger.
Click to enlarge