I've recently got interested in the problem of computing primes less than a certain ceiling and I've hit a roadblock, but it feels like a matter of not being well-read enough to work around it rather than being somehow fundamental.

I started with a brute-force Sieve of Eratosthenes and quickly realised that that could be optimized because I only needed to store 1 bit per value in the sieved range. This worked up to roughly 32 billion, and then my machine ran out of memory capacity. I tried again with a more sophisticated technique that chopped the range into smaller pages, and seperately kept track of which primes had been discovered so far and how far along the page their "stepping" was, but that fared worse in both space and speed, since it used 96 bits per prime it encountered, and the rarity of primes didn't compensate enough. I know its also possible to do this with very little memory requirements (~64 bits per prime) by simply iterating through each possible value in turn (with some reduction by taking advantage of patterns like 6k+/-1) and accumulating a list of primes that way, but obviously massively increases the time required. I'm also not even sure that this works that much better than the first method of instantiating the entire range, because of the increased memory requirements.

Is there any more efficient technique that would let me calculate prime numbers above that 32 billion mark in a reasonable amount of time? (The ideal target would be being able to find the smallest prime under 1012, but I suspect that's far too ambitious to be able to do in 4GB of memory.) Possibly a hybrid of these that I'm not seeing? I'm interested in optimizations that aren't strictly mathematical as well, if it's possible to e.g. unload large chunks of memory to disk so that while the total memory use is high, not all of it is used at once.

You can get some ideas from here: http://primesieve.org/#implementationThe source code is also available on github. Depending on why you want/need to do this it may also be easier to just run a script to wget a list of primes.

You could use wheel factorization first to remove all the "common" numbers (numbers divisible by 2, 3, etc.). Doing the first two gets you a 2/3 reduction in space for a bit array, you can do more if you want, with diminishing returns. But if you need a really large sieve then you'll probably need to look into storing it in chunks on the hard drive.

1. Compute primes less than n. The fastest algorithm is a well-optimized segmented Sieve of Eratosthenes, and the fastest implementation is primesieve as lorb pointed out. While missing lots of optimizations used in more production code, he has a nice example of how the segmented sieve works.

A version of his demonstration segmented sieve using 1 bit per integer takes about 20 minutes to reach 10^12, with very little memory use. The fastest programs I know of at that size are yafu and primesieve which take under 5 minutes. Bernstein's primegen and Perl/ntheory take a bit under 10 minutes. Memory requirements for each are quite small. Perl/ntheory can generate and print primes up to 10^12 in 24 minutes on my machine. Note how even with specialized print routines, the output dominates the time required. This is especially true for primesieve which has faster sieving and much slower printing.

While people who read only the first paragraph of the Wikipedia page like to point out the Sieve of Atkin, in practice it is slower than a similarly optimized Sieve of Eratosthenes. Most of the reasons why are explained in detail in the page. In the example above, even primegen (a heavily optimized SoA implementation) is half the speed of the top two SoE implementations, and at 10^13 Perl/ntheory (using a segmented SoE) remains 2x slower but primegen now takes 4x longer. That is, it's slowing down relative to the SoEs.

2. Computing primes in a small window [a,b]. Lots of programs will do this, and significantly faster than computing them all to b. primesieve takes only a couple milliseconds to get primes +/- 10000 around 10^12. Again memory use is negligible at this size. Perl/ntheory also takes only a few milliseconds to sieve and print them. The idea is very similar to a segmented sieve.

At some threshold for a, you'll not want to do a full sieve any more, especially if the window b-a is small. This becomes obvious for examples like 10^300 to 10^300+10000. You certainly don't want to fully sieve that interval, but probably do a partial sieve followed by a primality test. Perl/ntheory takes 0.2 seconds to give the 16 primes in that range. Pari/GP takes 0.5 seconds with ispseudoprime or 28 seconds with isprime.

3. prevprime / nextprime. You certainly don't want to sieve all numbers to n just to find the previous prime of n. At some size of n you will want to use partial sieving, but that's somewhere in the 200-bit range. Until then, you use a simple wheel (e.g. mod-6 or mod-30) to skip obvious small multiples and use a fast primality test. That is the method used by Pari/GP and Perl/ntheory, among others.

You will never want to use the AKS test. It's horribly slow. Again, read the whole Wikipedia article and you find "While the algorithm is of immense theoretical importance, it is not used in practice. For 64-bit inputs, the Baillie–PSW primality test is deterministic and runs many orders of magnitude faster." Even trial division is faster for 64-bit inputs, for goodness sake. At some point around 10^25 AKS does beat trial division, but it will effectively never beat APR-CL or ECPP. Caveat that this with the current state of the art -- new research could change things, albeit AKS proper probably won't be it: Lenstra/Pomerance (2011) "A fundamentally new idea would be required to obtain a deterministic primality testing algorithm with run time exponent smaller than 6."

Discussing optimal primality tests is another few paragraphs, and there are lots of interesting implementation details. In short for 64-bit you want some simple pretests (e.g. divisibility by a few tiny primes), then either deterministic Miller-Rabin (at most 7 tests required) or BPSW.

Optimized code for this takes about 1 microsecond at 10^12. At 10^18 or so, average of about 2 microseconds.

To test primality using the sieve you only need test candidates through CEILING(SQRT(N)). For 32 billion, that works out to a maximum value of P of 178,886. Reason: Suppose that there is some composite number C = P*P such that P > SQRT(C)...but that isn't possible. Therefore any composite number C must have at least one factor <= SQRT(C)...we don't need to find all factors to prove a number is composite.

Only sieve by primes, don't include composites. Right off the bat, removing evens and odds, that means the list for the preceding contains 59,629 numbers at most, which won't run you out of memory. No need to test composites when we've tested for the factors of the composites.

It's a pig, but you can actually test an individual number by trying 2, then 3, then 5, 7, 11, 13, 17, 19... That is, starting with 5, alternately add 2 and 4, a simple wheel as mentioned in a post above. Increment needs two statements

test = test + inc;inc = 6 - inc;

Start test with 5, inc as 2, and repeat until test > CEILING(SQRT(N)). Not the most efficient, but if you're just testing occasionally to see if some number is prime (or searching for a prime near some number) this is fast enough you won't twiddle your thumbs for any number 12 digits or less. More than that, need better algorithm, but it's cheap and easy if you don't need more. (Makes a terrible sieve, though.)

Whether or not you meant to say this, I probably missed/misread it being said, but if you're wanting to save time sieving composites you read the first (singular) multiple of the candidate sieve number (assuming you start with all odd numbers, 3 to root(n), 2 and all further even numbers already being a given) and if that's presieved then you can forget sieving it and skip onto the next odd number to see if that is skippable also.

I don't know, without testing, if the 2, 3, 5, (+2, +4)* version breaks that. But I think that's ok. It's useful that no prime is in those gaps, with 5+(6n)+4 for n≥0 being alternately 6n+3 for n>0; thus a confirmed odd (2n+1), non-singular (n+1>1) multiple of 3.

(It's effectively a sieve of its own, and the most efficient of its class (33% removed), after the already integrated 2-sieve (50%), which can save you from pre-sieve-checking the 3-sieve's targets for compositeness, which may be comptationally easier than bit-wise/list-wise checking and skippping. You probably can't really do 30n+5 skip-exclusions (seems to be the next in the pattern) in a way that scales better than ruling out three times as many five-numbers by check-and-skip. Another layer of bouncey-checking before that adds complexity and eats up cycles and/or storage.)

If you're interested in storing primes, note that all primes > 30 are of the form 30n + {1,7,11,13,17,19,23,29}. There are 8 numbers in that set, so you can use a single byte to indicate all the primes in a block of 30 numbers. It's not too hard to write sieve algorithms so that they store the sieve in that form, although it's obviously slightly slower to read & write that data structure than a simple array.

FWIW, there's some Python code in this old post that generates numbers of that form. However, it's not that useful for trial division work, since it only saves you from testing 2 numbers out of 30 compared to the 10 you'd be testing using the much simpler to implement 6n + {1, 5} sequence. In my time tests, trial division using the 6n + {1, 5} sequence computed in-line ends up being a little faster than using the 30n + {1,7,11,13,17,19,23,29} generator (Python function calls are relatively slow).