Essay: Programming With Prime Numbers

September 25, 2012

We introduce today a new feature of Programming Praxis: essays. An essay isn’t an exercise; it doesn’t present a task for you to perform, but instead provides extended discussion and complete source code. Though essays may be based on exercises, the idea is that an essay can delve more deeply into a topic than is possible in the limited space and time of an exercise.

Our first essay is about a topic dear to my heart: programming with prime numbers. The essay presents two versions of the Sieve of Eratosthenes, uses trial division to identify primes and factor composites, and provides the Miller-Rabin pseudoprime checker and Pollard rho factoring algorithm. Source code is given in five languages: C, Haskell, Java, Python, and Scheme.

Please read the essay, and feel free to comment on it below; comments on the essay itself are closed. Let me know if you see any errors. And feel free to link to the essay on the internet if you know of places where it is appropriate.

Loving this essay format. A few more of these and you have the makings of a must-have book. So far I have spotted one error: On page 5, in the upper-right paragraph, there is an am+1 that should be a^m+1.

Question: the prime number theorem offers a very tight range for the value of the nth prime. It’s possible to check each of the numbers in that range to see which are prime in a reasonably efficient manner. If, however, the range contains multiple primes, is there any way to figure out which one is the nth?

Paul Hofstra writes with notice of a bug in the Python version of the primes function: the first while loop should read while p*p <= n. Otherwise, if n is the square of a prime, it will be incorrectly included in the list of primes returned by the function; for instance, primes(9) returns the list [2, 3, 5, 7, 9], which is incorrect. Thanks, Paul.

I think there’s either a bug in the python code for rho_factors or in my understanding of same!
In the rho_factor subroutine, the check “limit == 0″ is made. Limit starts at a large number by default and is never decremented, so that test will always return False. Should it be “limit == c” instead?

In that case, there should probably be some way to reset the limit after a 1 < d < n has been found. Suppose for the sake of simplicity that we've set limit = 10. We try c = 1, and it takes 3 iterations to find a loop (no factor). So we then try c = 2 with a limit of 7, and it finds a factor d after 5 iterations. But d is composite, so we start the algorithm again to factor d. Surely with a new number to factor we don't want to set limit = 2? There's probably also a case for resetting the limit after we change c, maybe to some sub-limit, to avoid 'wasting' all our iterations on a function f that has an unusually long tail.

Sorry for harping on about this, I just want to understand what's going on.
P.S. I realise now that "return rho_factor(n, c + 1, limit – 1)" is virtually no different than checking "limit == c". Whoops!

The use of limit is completely arbitrary, as determined by the needs of the user, which is the reason it was made a parameter to the function. If you initialize limit to -1, the rho algorithm will continue until all factors have been found, which may be what you want to do when the n being factored is fairly small (say, less than 20- or 25-digits). But, if you are using the rho algorithm to quickly extract small factors from a very large number, you probably want to initialize limit to a fairly small number so that once you have the small factors you can switch to a more powerful factoring method such as the elliptic curve algorithm. For example, if you want to factor a 120-digit n, you might start by trial division to 1000 or 10,000 to find all 3- or 4-digit factors, followed by a million steps of Pollard rho to find all factors up to 10- or 12-digits, followed by the elliptic curve algorithm until you are convinced the remaining co-factor is a semi-prime (because you have, with high probability, found all factors less than the cube root of the co-factor), followed by the quadratic sieve. But all that is really beyond the scope of what this essay does.

Ok, thanks for that. I guess it’s not so bad that, having found a composite factor d, we run the algorithm again with a smaller limit – with a smaller number to factor it should return much quicker. Further, if we remove all the small factors using trial division as you suggest, then the likelihood of finding a composite factor is greatly diminished.

I understand the difference between limit and c. I merely observed that my two proposed ‘solutions’ were the essentially the same: If you increment c at the same time you decrement limit, then checking whether limit == c or limit == 0 amount to the same thing (almost, because c starts at 1 instead of 0).