Implement the Sieve of Eratosthenes algorithm, with the only allowed optimization that the outer loop can stop at the square root of the limit, and the inner loop may start at the square of the prime just found.

That means especially that you shouldn't optimize by using pre-computed wheels, i.e. don't assume you need only to cross out odd numbers (wheel based on 2), numbers equal to 1 or 5 modulo 6 (wheel based on 2 and 3), or similar wheels based on low primes.

If there's an easy way to add such a wheel based optimization, implement it as an alternative version.

Note

It is important that the sieve algorithm be the actual algorithm used to find prime numbers for the task.

If this subroutine is called with the value of n in the accumulator, it will store an array of the primes less than n beginning at address 1000 hex and return the number of primes it has found in the accumulator.

GETN MACROMOVEQ #4,D0; Read a number from the keyboard into D1.L. TRAP #15ENDM

**---- Application-specific macros ----**

val MACRO; Used by bit sieve. Converts bit address to the number it represents.ADD.L \1,\1; double it because odd numbers are omittedADDQ #3,\1; add offset because initial primes (1, 2) are omittedENDM

It would have been better to require a result of the boolean mask rather than the actual list of primes.
The list of primes obtains readily from the mask by application of a simple function (here {⍵/⍳⍴⍵}).
Other related computations (such as the number of primes < n) obtain readily from the mask,
easier than producing the list of primes.

An initial array holds all numbers 2..max (which is entered on stdin);
then all products of integers are deleted from it;
the remaining are displayed in the unsorted appearance of a hash table.
Here, the script is entered directly on the commandline,
and input entered on stdin:

Here is an alternate version that uses an associative array to record composites with a prime dividing it. It can be considered a slow version, as it does not cross out composites until needed. This version assumes enough memory to hold all primes up to ULIMIT. It prints out noncomposites greater than 1.

100 PROGRAM "Sieve.bas"110 LET LIMIT=100120 NUMERIC T(1 TO LIMIT)130 FOR I=1 TO LIMIT140 LET T(I)=0150 NEXT160 FOR I=2 TO SQR(LIMIT)170 IF T(I)<>1 THEN180 FOR K=I*I TO LIMIT STEP I190 LET T(K)=1200 NEXT210 END IF220 NEXT230 FOR I=2 TO LIMIT ! Display the primes240 IF T(I)=0 THEN PRINT I;250 NEXT

If you only have 1k of RAM, this program will work—but you will only be able to sieve numbers up to 101. The program is therefore more useful if you have more memory available.

A note on FAST and SLOW: under normal circumstances the CPU spends about 3/4 of its time driving the display and only 1/4 doing everything else. Entering FAST mode blanks the screen (which we do not want to update anyway), resulting in substantially improved performance; we then return to SLOW mode when we have something to print out.

10 INPUT L 20 FAST 30 DIM N(L) 40 FOR I=2 TO SQR L 50 IF N(I) THEN GOTO 90 60 FOR J=I+I TO L STEP I 70 LET N(J)=1 80 NEXT J 90 NEXT I100 SLOW110 FOR I=2 TO L120 IF NOT N(I) THEN PRINT I;" ";130 NEXT I

This solution does not use an array. Instead, numbers themselves are used as variables. The numbers that are not prime are set (to the silly value "nonprime"). Finally all numbers up to the limit are tested for being initialised. The uninitialised (unset) ones must be the primes.

We first fill ones into an array and assume all numbers are prime.
Then, in a loop, fill zeroes into those places where i * j is less than or equal to n (number of primes requested), which means they have multiples!
To understand this better, look at the output of the following example.

To print this back, we look for ones in the array and only print those spots.

This implementation follows the standard library pattern of std::iota. The start and end iterators are provided for the container. The destination container is used for marking primes and then filled with the primes which are less than the container size. This method requires no memory allocation inside the function.

To show that C# code can be written in somewhat functional paradigms, the following in an implementation of the Richard Bird sieve from the Epilogue of [Melissa E. O'Neill's definitive article](http://www.cs.hmc.edu/~oneill/papers/Sieve-JFP.pdf) in Haskell:

The following code implements an improved version of the odds-only O'Neil algorithm, which provides the improvements of only adding base prime composite number streams to the queue when the sieved number reaches the square of the base prime (saving a huge amount of memory and considerable execution time, including not needing as wide a range of a type for the internal prime numbers) as well as minimizing stream processing using fusion:

The above code runs in about three quarters of the time as the above Priority Queue based version for a range of a million primes which will fall even further behind for increasing ranges due to the Dictionary providing O(1) access times as compared to the O(log n) access times for the Priority Queue; the only slight advantage of the PQ based version is at very small ranges where the constant factor overhead of computing the table hashes becomes greater than the "log n" factor for small "n".

Page Segmented Array Sieve

All of the above unbounded versions are really just an intellectual exercise as with very little extra lines of code above the fastest Dictionary based version, one can have an bit-packed page-segmented array based version as follows:

The above code is about 25 times faster than the Dictionary version at computing the first about 50 million primes (up to a range of one billion), with the actual enumeration of the result sequence now taking longer than the time it takes to cull the composite number representation bits from the arrays, meaning that it is over 50 times faster at actually sieving the primes. The code owes its speed as compared to a naive "one huge memory array" algorithm to using an array size that is the size of the CPU L1 or L2 caches and using bit-packing to fit more number representations into this limited capacity; in this way RAM memory access times are reduced by a factor of from about four to about 10 (depending on CPU and RAM speed) as compared to those naive implementations, and the minor computational cost of the bit manipulations is compensated by a large factor in total execution time.

The time to enumerate the result primes sequence can be reduced somewhat (about a second) by removing the automatic iterator "yield return" statements and converting them into a "rull-your-own" IEnumerable<PrimeT> implementation, but for page segmentation of odds-only, this iteration of the results will still take longer than the time to actually cull the composite numbers from the page arrays.

In order to make further gains in speed, custom methods must be used to avoid using iterator sequences. If this is done, then further gains can be made by extreme wheel factorization (up to about another about four times gain in speed) and multi-processing (with another gain in speed proportional to the actual independent CPU cores used).

Note that all of these gains in speed are not due to C# other than it compiles to reasonably efficient machine code, but rather to proper use of the Sieve of Eratosthenes algorithm.

All of the above unbounded code can be tested by the following "main" method (replace the name "PrimesXXX" with the name of the class to be tested):

This example is incorrect. Please fix the code and remove this message.

Details: The first version uses rem testing and so is a trial division algorithm, not a sieve of Eratosthenes.

Note: this is not a Sieve of Eratosthenes; it is just an inefficient version (because it doesn't trial just the found primes <= the square root of the range) trial division.

primes< is a functional interpretation of the Sieve of Eratosthenes. It merely removes the set of composite numbers from the set of odd numbers (wheel of 2) leaving behind only prime numbers. It uses a transducer internally with "into #{}".

This implemantation is about twice fast than previous one and use only half memory.
From the index of array calculates the value it represents as (2*i + 1), the step between two index that represents
the multiples of primes to mark as composite is also (2*i + 1).
The index of the square of the prime to start composite marking is 2*i*(i+1).

The reason that the above code is so slow is that it has has a high constant factor overhead due to using a (hash) set to remove the composites from the future composites stream, each prime composite stream removal requires a scan across all remaining composites (compared to using an array or vector where only the culled values are referenced, and due to the slowness of Clojure sequence operations as compared to iterator/sequence operations in other languages.

Version based on immutable Vector's

Here is an immutable boolean vector based non-lazy sequence version other than for the lazy sequence operations to output the result:

The above code is about as fast as any "one large sieving array" type of program in any computer language with this level of wheel factorization other than the lazy sequence operations are quite slow: it takes about ten times as long to enumerate the results as it does to do the actual sieving work of culling the composites from the sieving buffer array. The slowness of sequence operations is due to nested function calls, but primarily due to the way Clojure implements closures by "boxing" all arguments (and perhaps return values) as objects in the heap space, which then need to be "un-boxed" as primitives as necessary for integer operations. Some of the facilities provided by lazy sequences are not needed for this algorithm, such as the automatic memoization which means that each element of the sequence is calculated only once; it is not necessary for the sequence values to be retraced for this algorithm.

If further levels of wheel factorization were used, the time to enumerate the resulting primes would be an even higher overhead as compared to the actual composite number culling time, would get even higher if page segmentation were used to limit the buffer size to the size of the CPU L1 cache for many times better memory access times, most important in the culling operations, and yet higher again if multi-processing were used to share to page segment processing across CPU cores.

The following code overcomes many of those limitations by using an internal (OPSeq) "deftype" which implements the ISeq interface as well as the Counted interface to provide immediate count returns (based on a pre-computed total), as well as the IReduce interface which can greatly speed come computations based on the primes sequence (eased greatly using facilities provided by Clojure 1.7.0 and up):

Due to the better efficiency of the custom CIS type, the primes to the above range can be enumerated in about the same 40 milliseconds that it takes to cull and count the sieve buffer array.

Under Clojure 1.7.0, one can use '(time (reduce (fn [] (+ (long sum) (long v))) 0 (primes-tox 2000000)))' to find "142913828922" as the sum of the primes to two million as per Euler Problem 10 in about 40 milliseconds total with about half the time used for sieving the array and half for computing the sum.

To show how sensitive Clojure is to forms of expression of functions, the simple form '(time (reduce + (primes-tox 2000000)))' takes about twice as long even though it is using the same internal routine for most of the calculation due to the function not having the type coercion's.

Before one considers that this code is suitable for larger ranges, it is still lacks the improvements of page segmentation with pages about the size of the CPU L1/L2 caches (produces about a four times speed up), maximal wheel factorization (to make it another about four times faster), and the use of multi-processing (for a further gain of about 4 times for a multi-core desktop CPU such as an Intel i7), will make the sieving/counting code about 50 times faster than this, although there will only be a moderate improvement in the time to enumerate/process the resulting primes. Using these techniques, the number of primes to one billion can be counted in a small fraction of a second.

The above code is still slower than it should be due to the slowness of Clojure's sequence operations.

A Clojure version of the above tree folding sieve using a custom Co Inductive Sequence

The following code uses a custom "deftype" non-memoizing Co Inductive Stream/Sequence (CIS) implementing the ISeq interface to make the sequence operations more efficient and is about four times faster than the above code:

The above code is useful for ranges up to about fifteen million primes, which is about the first million primes; it is comparable in speed to all of the bounded versions except for the fastest bit packed version which can reasonably be used for ranges about 100 times as large.

Incremental Hash Map based unbounded "odds-only" version

The following code is a version of the O'Neill Haskell code but does not use wheel factorization other than for sieving odds only (although it could be easily added) and uses a Hash Map (constant amortized access time) rather than a Priority Queue (log n access time for combined remove-and-insert-anew operations, which are the majority used for this algorithm) with a lazy sequence for output of the resulting primes; the code has the added feature that it uses a secondary base primes sequence generator and only adds prime culling sequences to the composites map when they are necessary, thus saving time and limiting storage to only that required for the map entries for primes up to the square root of the currently sieved number:

The above code is slower than the best tree folding version due to the added constant factor overhead of computing the hash functions for every hash map operation even though it has computational complexity of (n log log n) rather than the worse (n log n log log n) for the previous incremental tree folding sieve. It is still about 100 times slower than the sieve based on the bit-packed mutable array due to these constant factor hashing overheads.

There is almost no benefit of converting the above code to use a CIS as most of the time is expended in the hash map functions.

Incremental Priority Queue based unbounded "odds-only" version

In order to implement the O'Neill Priority Queue incremental Sieve of Eratosthenes algorithm, one requires an efficient implementation of a Priority Queue, which is not part of standard Clojure. For this purpose, the most suitable Priority Queue is a binary tree heap based MinHeap algorithm. The following code implements a purely functional (using entirely immutable state) MinHeap Priority Queue providing the required functions of (emtpy-pq) initialization, (getMin-pq pq) to examinte the minimum key/value pair in the queue, (insert-pq pq k v) to add entries to the queue, and (replaceMinAs-pq pq k v) to replaace the minimum entry with a key/value pair as given (it is more efficient that if functions were provided to delete and then re-insert entries in the queue; there is therefore no "delete" or other queue functions supplied as the algorithm does not requrie them:

Note that the above code is written partially using continuation passing style so as to leave the "recur" calls in tail call position as required for efficient looping in Clojure; for practical sieving ranges, the algorithm could likely use just raw function recursion as recursion depth is unlikely to be used beyond a depth of about ten or so, but raw recursion is said to be less code efficient.

The actual incremental sieve using the Priority Queue is as follows, which code uses the same optimizations of postponing the addition of prime composite streams to the queue until the square root of the currently sieved number is reached and using a secondary base primes stream to generate the primes composite stream markers in the queue as was used for the Hash Map version:

The above code is faster than the Hash Map version up to about a sieving range of fifteen million or so, but gets progressively slower for larger ranges due to having (n log n log log n) computational complexity rather than the (n log log n) for the Hash Map version, which has a higher constant factor overhead that is overtaken by the extra "log n" factor.

It is slower that the fastest of the tree folding versions (which has the same computational complexity) due to the higher constant factor overhead of the Priority Queue operations (although perhaps a more efficient implementation of the MinHeap Priority Queue could be developed).

Again, these non-mutable array based sieves are about a hundred times slower than even the "one large memory buffer array" version as implemented in the bounded section; a page segmented version of the mutable bit-packed memory array would be several times faster.

All of these algorithms will respond to maximum wheel factorization, getting up to approximately four times faster if this is applied as compared to the the "odds-only" versions.

It is difficult if not impossible to apply efficient multi-processing to the above versions of the unbounded sieves as the next values of the primes sequence are dependent on previous changes of state for the Bird and Tree Folding versions; however, with the addition of a "update the whole Priority Queue (and reheapify)" or "update the Hash Map" to a given page start state functions, it is possible to do for these letter two algorithms; however, even though it is possible and there is some benefit for these latter two implementations, the benefit is less than using mutable arrays due to that the results must be enumerated into a data structure of some sort in order to be passed out of the page function whereas they can be directly enumerated from the array for the mutable array versions.

Bit packed page segmented array unbounded "odds-only" version

To show that Clojure does not need to be particularly slow, the following version runs about twice as fast as the non-segmented unbounded array based version above (extremely fast compared to the non-array based versions) and only a little slower than other equivalent versions running on virtual machines: C# or F# on DotNet or Java and Scala on the JVM:

The above code runs just as fast as other virtual machine languages when run on a 64-bit JVM; however, when run on a 32-bit JVM it runs almost five times slower. This is likely due to Clojure only using 64-bit integers for integer operations and these operations getting JIT compiled to use library functions to simulate those operations using combined 32-bit operations under a 32-bit JVM whereas direct CPU operations can be used on a 64-bit JVM

Clojure does one thing very slowly, just as here: it enumerates extremely slowly as compared to using a more imperative iteration interface; it helps to use a roll-your-own ISeq interface as here, where enumeration of the primes reduces the time from about four times as long as the composite culling operations for those primes to only about one and a half times as long, although one must also write their own sequence handling functions (can't use "take-while" or "count", for instance) in order to enjoy that benefit. That is why the "primes-paged-count-to" function is provided so it takes a negligible percentage of the time to count the primes over a range as compared to the time for the composite culling operations.

The practical range of the above sieve is about 16 million due to the fixed size of the page buffers; in order to extend the range, a larger page buffer could be used up to the size of the CPU L2 or L3 caches. If a 2^20 buffer were used (one Megabyte, as many modern dexktop CPU's easily have in their L3 cache), then the range would be increased up to about 10^14 at a cost of about a factor of two or three in slower memory accesses per composite culling operation loop. The base primes culling page size is already adequate for this range. One could make the culling page size automatically expand with growing range by about the square root of the current prime range with not too many changes to the code.

As for many implementations of unbounded sieves, the base primes less than the square root of the current range are generated by a secondary generated stream of primes; in this case it is done recursively, so another secondary stream generates the base primes for the base primes and so on down to where the innermost generator has only one page in the stream; this only takes one or two recursions for this type of range.

The base primes culling page size is reduced from the page size for the main primes so that there is less overhead for smaller primes ranges; otherwise excess base primes are generated for fairly small sieve ranges.

# For each multiple of i, set n => no it is not prime.# Optimization: start at i squared.math(EXPR square "${i} * ${i}")if(NOT square GREATER${limit})# Avoid fatal error.foreach(m RANGE ${square}${limit}${i})set(prime_${m} n)endforeach(m)endif()endif(prime_${i})endforeach(i)set(${var}${list}PARENT_SCOPE)endfunction(eratosthenes)

The indexation scheme used here interprets each index i as standing for the value 2i+1. Bit 0 is unused, a small price to pay for the simpler index calculations compared with the 2i+3 indexation scheme. The multiples of a given odd prime p are enumerated in increments of 2p, which corresponds to the index increment of p on the sieve array. The starting point p*p = (2i+1)(2i+1) = 4i(i+1)+1 corresponds to the index 2i(i+1).

While formally a wheel, odds are uniformly spaced and do not require any special processing except for value translation. Wheels proper aren't uniformly spaced and are thus trickier.

Although it has the characteristics of a true Sieve of Eratosthenes, the above code isn't very efficient due to the remove/modify operations on the Set. Due to these, the computational complexity isn't close to linear with increasing range and it is quite slow for larger sieve ranges compared to compiled languages, taking about four seconds to sieve to ten million.

The above code is somewhat faster at about ten seconds using the Dart VM to sieve to 100 million, although much faster at about 1.5 seconds run conventionally in Google Chrome using the JavaScript V8 engine, likely due to JavaScript using double floating point numbers for int's whereas the Dart VM uses arbitrary precision integers.

This version calculates the 50,847,534 primes up to one billion in about 20 seconds under the Dart Virtual Machine (VM). Under the Google Chrome V8 JavaScript engine it should take the same time as the JavaScript from which it was translated of about five seconds, but takes about 14 seconds due to the dart2js compiler adding extra run time array buffer range checks to the innermost culling loops, even though the "check" compiler option was not selected.

Also note the comment at the beginning of the "moveNext()" method about the redundant local variable needed to be added in order for the code to run under JavaScript using Dart 1.5.1 (and possible other versions), which shouldn't happen when it runs fine under the Dart VM without that extra local variable (based only on the private class field _lowi).

A much slower, perverse method, using only lists of tuples. Especially evil is the P = lists:filtermap operation which yields a list for every iteration of the X * M row.
Has the virtue of working for any -> N :)

BEGIN PRINT("Only 1 iteration") COUNT%=0 FOR I%=0 TO SIZE% DO IF FLAGS%[I%]=TRUE THEN !$NULL ELSE PRIME%=I%+I%+3 K%=I%+PRIME% WHILE NOT (K%>SIZE%) DO FLAGS%[K%]=TRUE K%=K%+PRIME% END WHILE PRINT(PRIME%;) COUNT%=COUNT%+1 END IF END FOR PRINT PRINT(COUNT%;" PRIMES")END PROGRAM

This is the idea behind Richard Bird's unbounded code presented in the Epilogue of M. O'Neill's article in Haskell. It is about twice as much code as the Haskell code because F# does not have a built-in lazy list so that the effect must be constructed using a Co-Inductive Stream (CIS) type since no memoization is required, along with the use of recursive functions in combination with sequences. The type inference needs some help with the new CIS type (including selecting the generic type for speed). Note the use of recursive functions to implement multiple non-sharing delayed generating base primes streams, which along with these being non-memoizing means that the entire primes stream is not held in memory as for the original Bird code:

The above code sieves all numbers of two and up including all even numbers as per the page specification; the following code makes the very minor changes for an odds-only sieve, with a speedup of over a factor of two:

The above code is still somewhat inefficient as it operates on a linear right extending structure that deepens linearly with increasing base primes (those up to the square root of the currently sieved number); the following code changes the structure into an infinite binary tree-like folding by combining each pair of prime composite streams before further processing as usual - this decreases the processing by approximately a factor of log n:

The above code is over four times faster than the "BirdOdds" version (at least 10x faster than the first, "primesBird", producing the millionth prime) and is moderately useful for a range of the first million primes or so.

Priority Queue Sieve

In order to investigate Priority Queue Sieves as espoused by O'Neill in the referenced article, one must find an equivalent implementation of a Min Heap Priority Queue as used by her. There is such an purely functional implementation in RosettaCode translated from the Haskell code she used, from which the essential parts are duplicated here (Note that the key value is given an integer type in order to avoid the inefficiency of F# in generic comparison):

Except as noted for any individual code, all of the following codes need the following prefix code in order to implement the non-memoizing Co-Inductive Streams (CIS's) and to set the type of particular constants used in the codes to the same time as the "Prime" type:

The F# equivalent to O'Neill's "odds-only" code is then implemented as follows, which needs the included changed prefix in order to change the primes type to a larger one to prevent overflow (as well the key type for the MinHeap needs to be changed from uint32 to uint64); it is functionally the same as the O'Neill code other than for minor changes to suit the use of CIS streams and the option output of the "peekMin" function:

However, that algorithm suffers in speed and memory use due to over-eager adding of prime composite streams to the queue such that the queue used is much larger than it needs to be and a much larger range of primes number must be used in order to avoid numeric overflow on the square of the prime added to the queue. The following code corrects that by using a secondary (actually a multiple of) base primes streams which are constrained to be based on a prime that is no larger than the square root of the currently sieved number - this permits the use of much smaller Prime types as per the default prefix:

Substituting the following minor changes to the code for the "primes" function will only deal with the odd prime candidates for a speed up of over a factor of two as well as a reduction of the buffer size by a factor of two:

The following code uses other functional forms for the inner culling loops of the "primes function" to reduce the use of inefficient sequences so as to reduce the execution time by another factor of almost three:

Now much of the remaining execution time is just the time to enumerate the primes as can be seen by turning "primes" into a primes counting function by substituting the following for the last line in the above code doing the enumeration; this makes the code run about a further five times faster:

Since the final enumeration of primes is the main remaining bottleneck, it is worth using a "roll-your-own" enumeration implemented as an object expression so as to save many inefficiencies in the use of the built-in seq computational expression by substituting the following code for the last line of the previous codes, which will decrease the execution time by a factor of over three (instead of almost five for the counting-only version, making it almost as fast):

The various optimization techniques shown here can be used "jointly and severally" on any of the basic versions for various trade-offs between code complexity and performance. Not shown here are other techniques of making the sieve faster, including extending wheel factorization to much larger wheels such as 2/3/5/7, pre-culling the arrays, page segmentation, and multi-processing.

the following odds-only implmentations are written in an almost functional style avoiding the use of mutability except for the contents of the data structures uses to hold the state of the and any mutability necessary to implement a "roll-your-own" IEnumberable iterator interface for speed.

Unbounded Dictionary (Mutable Hash Table) Based Sieve

The following code uses the DotNet Dictionary class instead of the above functional Priority Queue to implement the sieve; as average (amortized) hash table access is O(1) rather than O(log n) as for the priority queue, this implementation is slightly faster than the priority queue version for the first million primes and will always be faster for any range above some low range value:

The above code uses functional forms of code (with the imperative style commented out to show how it could be done imperatively) and also uses a recursive non-sharing secondary source of base primes just as for the Priority Queue version. As for the functional codes, the Primes type can easily be changed to "uint64" for wider range of sieving.

In spite of having true O(n log log n) Sieve of Eratosthenes computational complexity where n is the range of numbers to be sieved, the above code is still not particularly fast due to the time required to compute the hash values and manipulations of the hash table.

Unbounded Page Segmented Mutable Array Sieve

All of the above unbounded implementations including the above Dictionary based version are quite slow due to their large constant factor computational overheads, making them more of an intellectual exercise than something practical, especially when larger sieving ranges are required. The following code implements an unbounded page segmented version of the sieve in not that many more lines of code, yet runs about 25 times faster than the Dictionary version for larger ranges of sieving such as to one billion; it uses functional forms without mutability other than for the contents of the arrays and a reference cell used to implement the "roll-your-own" IEnumerable/IEnumerator interfaces for speed:

As with all of the efficient unbounded sieves, the above code uses a secondary enumerator of the base primes less than the square root of the currently culled range ("lbpse"), which is this case is a lazy (deffered evaluation) binding so as to avoid a race condition.

The above code is written to output the "uint64" type for very large ranges of primes since there is little computational cost to doing this for this algorithm. As written, the practical range for this sieve is about 16 billion, however, it can be extended to about 10^14 (a week or two of execution time) by setting the "PGSZBTS" constant to the size of the CPU L2 cache rather than the L1 cache (L2 is up to about two Megabytes for modern high end desktop CPU's) at a slight loss of efficiency (a factor of up to two or so) per composite number culling operation due to the slower memory access time.

Even with the custom IEnumerable/IEnumerator interfaces using an object expression (the F# built-in sequence operators are terribly inefficient), the time to enumerate the resulting primes takes longer than the time to actually cull the composite numbers from the sieving arrays. The time to do the actual culling is thus over 50 times faster than done using the Dictionary version. The slowness of enumeration, no matter what further tweaks are done to improve it (each value enumerated will always take function calls and a scan loop that will always take something in the order of 100 CPU clock cycles per value), means that further gains in speed using extreme wheel factorization and multi-processing have little point unless the actual work on the resulting primes is done through use of auxiliary functions not using iteration.

Factor already contains two implementations of the sieve of Eratosthenes in math.primes.erato and math.primes.erato.fast. It is suggested to use one of them for real use, as they use faster types, faster unsafe arithmetic, and/or wheels to speed up the sieve further. Shown here is a more straightforward implementation that adheres to the restrictions given by the task (namely, no wheels).

Factor is pleasantly multiparadigm. Usually, it's natural to write more functional or declarative code in Factor, but this is an instance where it is more natural to write imperative code. Lexical variables are useful here for expressing the necessary mutations in a clean way.

Fōrmulæ programs are not textual, visualization/edition of programs is done showing/manipulating structures but not text. Moreover, there can be multiple visual representations of the same program. Even though it is possible to have textual representation —i.e. XML, JSON— they are intended for transportation effects more than visualization and edition.

The option to show Fōrmulæ programs and their results is showing images. Unfortunately images cannot be uploaded in Rosetta Code.

The above version's output is rather specialized; the following version uses a closure function to enumerate over the culled composite number array, which is bit packed. By using this scheme for output, no extra memory is required above that required for the culling array:

This solution uses a BitSet for compactness and speed, but in Groovy, BitSet has full List semantics. It also uses both the "square root of the boundary" shortcut and the "square of the prime" shortcut.

This represents odds only in the array. Empirical orders of growth is ~ n1.2 in n primes produced, and improving for bigger n‍ ‍s. Memory consumption is low (array seems to be packed) and growing about linearly with n. Can further be significantly sped up by re-writing the forM_ loops with direct recursion, and using unsafeRead and unsafeWrite operations.

Its performance sharply depends on compiler optimizations. Compiled with -O2 flag in the presence of the explicit type signature, it is very fast in producing first few million primes. (//) is an array update operator.

Its time complexity is similar to that of optimal trial division because of limitations of Haskell linked lists, where (minus a b) takes time proportional to length(union a b) and not (length b), as achieved in imperative setting with direct-access memory. Uses ordered list representation of sets.

This is reasonably useful up to ranges of fifteen million or about the first million primes.

This is slow, with complexity increasing as a square law or worse so that it is only moderately useful for the first few thousand primes or so.

The number of active streams can be limited to what's strictly necessary by postponement until the square of a prime is seen, getting a massive complexity improvement to better than ~ n1.5 so it can get first million primes or so in a tolerable time:

Using _Y is meant to guarantee the separate supply of primes to be independently calculated, recursively, instead of the same one being reused, corecursively; thus the memory footprint is drastically reduced. This idea was introduced by M. ONeill as a double-staged production, with a separate primes feed.

The above code is also useful to a range of the first million primes or so. The code can be further optimized by fusing minus [3..] into one function, preventing a space leak with the newer GHC versions, getting the function gaps defined below.

Linear merging structure can further be replaced with an wiki.haskell.org/Prime_numbers#Tree_merging indefinitely deepening to the right tree-like structure, (a-(b+((c+d)+( ((e+f)+(g+h)) + ... )))).

This merges primes' multiples streams in a tree-like fashion, as a sequence of balanced trees of union nodes, likely achieving theoretical time complexity only a log n factor above the optimal n log n log (log n), for n primes produced. Indeed, empirically it runs at about ~ n1.2 (for producing first few million primes), similarly to priority-queue–based version of M. O'Neill's, and with very low space complexity too (not counting the produced sequence of course):

The above work is derived from the Epilogue of the Melissa E. O'Neill paper which is much referenced with respect to incremental functional sieves; however, that paper is now dated and her comments comparing list based sieves to her original work leading up to a Priority Queue based implementation is no longer current given more recent work such as the above Tree Merging version. Accordingly, a modern "odd's-only" Priority Queue version is developed here for more current comparisons between the above list based incremental sieves and a continuation of O'Neill's work.

In order to implement a Priority Queue version with Haskell, an efficient Priority Queue, which is not part of the standard Haskell libraries is required. A Min Heap implementation is likely best suited for this task in providing the most efficient frequently used peeks of the next item in the queue and replacement of the first item in the queue (not using a "pop" followed by a "push) with "pop" operations then not used at all, and "push" operations used relatively infrequently. Judging by O'Neill's use of an efficient "deleteMinAndInsert" operation which she states "(We provide deleteMinAndInsert becausea heap-based implementation can support this operation with considerably less rearrangement than a deleteMin followed by an insert.)", which statement is true for a Min Heap Priority Queue and not others, and her reference to a priority queue by (Paulson, 1996), the queue she used is likely the one as provided as a simple true functional Min Heap implementation on RosettaCode, from which the essential functions are reproduced here:

The "peekMin" function retrieves both of the key and value in a tuple so processing is required to access whichever is required for further processing. As well, the output of the peekMin function is a Maybe with the case of an empty queue providing a Nothing output.

The following code is O'Neill's original odds-only code (without wheel factorization) from her paper slightly adjusted as per the requirements of this Min Heap implementation as laid out above; note the `seq` adjustments to the "adjust" function to make the evaluation of the entry tuple more strict for better efficiency:

-- (c) 2006-2007 Melissa O'Neill. Code may be used freely so long as-- this copyright message is retained and changed versions of the file-- are clearly marked.-- the only changes are the names of the called PQ functions and the-- included processing for the result of the peek function being a maybe tuple.

The above code is almost four times slower than the version of the Tree Merging sieve above for the first million primes although it is about the same speed as the original Richard Bird sieve with the "odds-only" adaptation as above. It is slow and uses a huge amount of memory for primarily one reason: over eagerness in adding prime composite streams to the queue, which are added as the primes are listed rather than when they are required as the output primes stream reaches the square of a given base prime; this over eagerness also means that the processed numbers must have a large range in order to not overflow when squared (as in the default Integer = infinite precision integers as used here and by O'Neill, but Int64's or Word64's would give a practical range) which processing of wide range numbers adds processing and memory requirement overhead. Although O'Neill's code is elegant, it also loses some efficiency due to the extensive use of lazy list processing, not all of which is required even for a wheel factorization implementation.

The following code is adjusted to reduce the amount of lazy list processing and to add a secondary base primes stream (or a succession of streams when the combinator is used) so as to overcome the above problems and reduce memory consumption to only that required for the primes below the square root of the currently sieved number; using this means that 32-bit Int's are sufficient for a reasonable range and memory requirements become relatively negligible:

The above code is over five times faster than the previous (O'Neill) Priority Queue code and about half again faster than the Tree Merging code for a range of a million primes, and will always be faster as the Min Heap is slightly more efficient than Tree Merging due to better tree balancing.

All of these codes including the list based ones would enjoy about the same constant factor improvement of up to about four times the speed with the application of maximum wheel factorization.

All of the above unbounded sieves are quite limited in practical sieving range due to the large constant factor overheads in computation, making them mostly just interesting intellectual exercises other than for small ranges of about the first million to ten million primes; the following "odds-only page-segmented version using (bit-packed internally) mutable unboxed arrays is about 50 times faster than the fastest of the above algorithms for ranges of about that and higher, making it practical for the first several hundred million primes:

The above code is currently implemented to use "Int" as the prime type but one can change the "PrimeType" to "Int64" (importing Data.Int) or "Word64" (importing Data.Word) to extend the range to its maximum practical range of above 10^14 or so. Note that for larger ranges that one will want to set the "szPGBTS" to something close to the CPU L2 or even L3 cache size (up to 8 Megabytes = 2^23 for an Intel i7) for a slight cost in speed (about a factor of 1.5) but so that it still computes fairly efficiently as to memory access up to those large ranges. It would be quite easy to modify the above code to make the page array size automatically increase in size with increasing range.

The above code takes only a few tens of milliseconds to compute the first million primes and a few seconds to calculate the first 50 million primes, with over half of those times expended in just enumerating the result lazy list, with even worse times when using 64-bit list processing (especially with 32-bit versions of GHC). A further improvement to reduce the computational cost of repeated list processing across the base pages for every page segment would be to store the required base primes (or base prime gaps) in an array that gets extended in size by factors of two (by copying the old array to the new extended array) as the number of base primes increases; in that way the scans across base primes per page segment would just be array accesses which are much faster than list enumeration.

Unlike many other other unbounded examples, this algorithm has the true Sieve of Eratosthenes computational time complexity of O(n log log n) where n is the sieving range with no extra "log n" factor while having a very low computational time cost per composite number cull of less than ten CPU clock cycles per cull (well under as in under 4 clock cycles for the Intel i7 using a page buffer size of the CPU L1 cache size).

There are other ways to make the algorithm faster including high degrees of wheel factorization, which can reduce the number of composite culling operations by a factor of about four for practical ranges, and multi-processing which can reduce the computation time proportionally to the number of available independent CPU cores, but there is little point to these optimizations as long as the lazy list enumeration is the bottleneck as it is starting to be in the above code. To take advantage of those optimizations, functions need to be provided that can compute the desired results without using list processing.

For ranges above about 10^14 where culling spans begin to exceed even an expanded size page array, other techniques need to be adapted such as the use of a "bucket sieve" which tracks the next page that larger base prime culling sequences will "hit" to avoid redundant (and time expensive) start address calculations for base primes that don't "hit" the current page.

However, even with the above code and its limitations for large sieving ranges, the speeds will never come close to as slow as the other "incremental" sieve algorithms, as the time will never exceed about 100 CPU clock cycles per composite number cull, where the fastest of those other algorithms takes many hundreds of CPU clock cycles per cull.

And we might want to express this sentence as a definition of a word that lets us use it with an arbitrary argument. There are a variety of ways of doing this. For example:

sieve0=: verb def 'I.2=+/0=|/~ i.y'

That said, this fails with an argument of 2 (instead of giving an empty list of the primes smaller than 2, it gives a list with one element: 0). Working through why this is and why this matters can be an informative exercise. But, assuming this matters, we need to add some guard logic to prevent that problem:

sieve0a=: verb def 'I.(y>2)*2=+/0=|/~ i.y'

Of course, we can also express this in an even more elaborate fashion. The elaboration makes more efficient use of resources for large arguments, at the expense of less efficient use of resources for small arguments:

The use of 2 3 5 7 as wheels provides a
20% time improvement for n=1000 and 2% for n=1e6 but note that sieve2 is still 25 times slower than i.&.(p:inv) for n=1e6. Then again, the value of the sieve of eratosthenes was not efficiency but simplicity. So perhaps we should ignore resource consumption issues and instead focus on intermediate results for reasonably sized example problems?

privateBigInteger i = BigInteger.valueOf(2);// priority queue of the sequences of non-primes// the priority queue allows us to get the "next" non-prime quicklyfinalPriorityQueue<NonPrimeSequence> nonprimes = newPriorityQueue<NonPrimeSequence>();

@Overridepublicboolean hasNext(){returntrue;} @OverridepublicBigInteger next(){// skip non-prime numbersfor(;!nonprimes.isEmpty()&& i.equals(nonprimes.peek().currentMultiple); i = i.add(BigInteger.ONE)){// for each sequence that generates this number,// have it go to the next number (simply add the prime)// and re-position it in the priority queuewhile(nonprimes.peek().currentMultiple.equals(i)){ NonPrimeSequence x = nonprimes.poll(); x.currentMultiple = x.currentMultiple.add(x.prime); nonprimes.offer(x);}}// prime// insert a NonPrimeSequence object into the priority queue nonprimes.offer(new NonPrimeSequence(i));BigInteger result = i; i = i.add(BigInteger.ONE);return result;}

The adding of each discovered prime's incremental step information to the mapping should be postponed until the candidate number reaches the primes square, as it is useless before that point. This drastically reduces the space complexity from O(n/log(n)) to O(sqrt(n/log(n))), in n primes produced, and also lowers the run time complexity due to the use of the hash table based HashMap, which is much more efficient for large ranges.

Although somewhat faster than the previous infinite iterator version, the above code is still over 10 times slower than an infinite iterator based on array paged segmentation as in the following code, where the time to enumerate/iterate over the found primes (common to all the iterators) is now about half of the total execution time:

While the above code is quite fast especially using an efficient JavaScript engine such as Google Chrome's V8, it isn't as elegant as it could be using the features of the new EcmaScript6 specification when it comes out about the end of 2014 and when JavaScript engines including those of browsers implement that standard in that we might choose to implement an incremental algorithm iterators or generators similar to as implemented in Python or F# (yield). Meanwhile, we can emulate some of those features by using a simulation of an iterator class (which is easier than using a call-back function) for an "infinite" generator based on an Object dictionary as in the following odds-only code written as a JavaScript class:

var SoEIncClass =(function(){function SoEIncClass(){this.n=0;} SoEIncClass.prototype.next=function(){this.n+=2;if(this.n<7){// initialization of sequence to avoid runaway:if(this.n<3){// only even of two:this.n=1;// odds from here...return2;}if(this.n<5)return3;this.dict={};// n must be 5...this.bps=new SoEIncClass();// new source of base primesthis.bps.next();// advance past the even prime of two...this.p=this.bps.next();// first odd prime (3 in this case)this.q=this.p*this.p;// set guardreturn5;}else{// past initialization:var s =this.dict[this.n];// may or may not be defined...if(!s){// not defined:if(this.n<this.q)// haven't reached the guard:returnthis.n;// found a primeelse{// n === q => not prime but at guard, so:var p2 =this.p<<1;// the span odds-only is twice primethis.dict[this.n+ p2]= p2;// add next composite of prime to dictthis.p=this.bps.next();this.q=this.p*this.p;// get next base prime guardreturnthis.next();// not prime so advance...}}else{// is a found composite of previous base prime => not primedeletethis.dict[this.n];// advance to next composite of this prime:var nxt =this.n+ s; while (this.dict[nxt]) nxt += s;// find unique empty slot in dictthis.dict[nxt]= s;// to put the next composite for this base primereturnthis.next();// not prime so advance...}}};return SoEIncClass;})();

The above code can be used to find the nth prime (which would require estimating the required range limit using the previous fixed range code) by using the following code:

to produce the following output (about five seconds using Google Chrome's V8 JavaScript engine):

15485863

The above code is considerably slower than the fixed range code due to the multiple method calls and the use of an object as a dictionary, which (using a hash table as its basis for most implementations) will have about a constant O(1) amortized time per operation but has quite a high constant overhead to convert the numeric indices to strings which are then hashed to be used as table keys for the look-up operations as compared to doing this more directly in implementations such as the Python dict with Python's built-in hashing functions for every supported type.

This can be implemented as an "infinite" odds-only generator using page segmentation for a considerable speed-up with the alternate JavaScript class code as follows:

The above code is about fifty times faster (about five seconds to calculate 50 million primes to about a billion on the Google Chrome V8 JavaScript engine) than the above dictionary based code.

The speed for both of these "infinite" solutions will also respond to further wheel factorization techniques, especially for the dictionary based version where any added overhead to deal with the factorization wheel will be negligible compared to the dictionary overhead. The dictionary version would likely speed up about a factor of three or a little more with maximum wheel factorization applied; the page segmented version probably won't gain more than a factor of two and perhaps less due to the overheads of array look-up operations.

At about 35 seconds for a range of a billion on my Intel Atom i5-Z8350 CPU at 1.92 GHz (single threaded) or about 70 CPU clock cycles per culling operation, the above examples are two of the very slowest ways to compute the Sieve of Eratosthenes over any kind of a reasonable range due to a couple of factors:

The output primes are extracted to a result array which takes time (and memory) to construct.

They use the naive "one huge memory array" method, which has poor memory access speed for larger ranges.

Even though the first uses an odds-only algorithm (not noted in the text as is a requirement of the task) that reduces the number of operations by a factor of about two and a half times, it is not faster than the second, which is not odds-only due to the second being set up to take advantage of the `findall` function to directly output the indices of the remaining true values as the found primes; the second is faster due to the first taking longer to push the found primes singly to the constructed array, whereas internally the second first creates the array to the size of the counted true values and then just fills it.

Also, the first uses more memory than necessary in one byte per `Bool` where using a `BitArray` as in the second reduces this by a factor of eight.

If one is going to "crib" the MatLab algorithm as above, one may as well do it using odds-only as per the MatLab built-in. The following alternate code improves on the "Alternate" example above by making it sieve odds-only and adjusting the result array contents after to suit:

This takes less about 18.5 seconds or 36 CPU cycles per culling operation to find the primes to a billion, but that is still quite slow compared to what can be done below. Note that the result array needs to be created then copied, created by the findall function, then modified in place by the map! function to transform the indices to primes, and finally copied by the pushfirst! function to add the only even prime of two to the beginning, but these operations are quire fast. However, this still consumes a lot of memory, as in about 64 Megabytes for the sieve buffer and over 400 Megabytes for the result (8-byte Int's for 64 bit execution) to sieve to a billion, and culling the huge culling buffer that doesn't fit the CPU cache sizes is what makes it slow.

The creation of an output results array is not necessary if the purpose is just to scan across the resulting primes once, they can be output using an iterator (from a `BitArray`) as in the following odds-only code:

This reduces the CPU cycles per culling cycles to about 24.4, but it's still slow due to using the one largish array. Note that counting each iterated prime takes an additional about one and a half seconds, where if all that is required is the count of primes over a range the specialized length function is much faster.

For any kind of reasonably large range such as a billion, a page segmented version should be used with the pages sized to the CPU caches for much better memory access times. As well, the following odds-only example uses a custom bit packing algorithm for a further two times speed-up, also reducing the memory allocation delays by reusing the sieve buffers when possible (usually possible):

# count the number of zero bits (primes) in a byte array,# also works for part arrays/slices, best used as an `@view`...function countComposites(cmpsts::AbstractArray{UInt8,1}) foldl((a, b) -> a + count_zeros(b), cmpsts; init = 0)end

Note that "the slow way" as commented out in the code takes an extra about 4.85 seconds to count the primes to a billion, or longer to enumerate the primes than to cull the composites; this makes further work in making this yet faster pointless unless techniques such as the one used here to count the number of found primes by just counting the un-cancelled bit representations in the sieved sieve buffers are used.

This takes about 1.9 seconds to count the primes to a billion (using the fast technique), or about 3.75 clock cycles per culling operation, which is reasonably fast; this is almost 20 times faster the the first naive sieves. As written, the algorithm maintains its efficiency up to about 16 billion and then slows down as the buffer size increases beyond the CPU L1 cache size into the L2 cache size such that it takes about 436.8 seconds to sieve to 100 billion instead of the expected about 300 seconds; however, an extra feature of "double buffered sieving" could be added so that the buffer is sieved in L1 cache slices followed by a final sweep of the entire buffer by the few remaining cull operations that use the larger primes for only a slight reduction in average cycles per cull up to a range of about 2.56e14 (for this CPU). For really large ranges above that, another sieving technique known as the "bucket sieve" that sorts the culling operations by page so that processing time is not expended for values that don't "hit" a given page can be used for only a slight additional reduction in efficiency.

Additionally, maximal wheel factorization can reduce the time by about a factor of four, plus multi-processing where the work is shared across the CPU cores can produce a further speed-up by the factor of the number of cores (only three times on this four-core machine due to the clock speed reducing to 75% of the rate when all cores are used), for an additional about 12 times speed-up for this CPU. These improvements are just slightly too complex to post here.

However, even the version posted shows that the naive "one huge array" implementations should never be used for sieving ranges of over a few million, and that Julia can come very close to the speed of the fastest languages such as C/C++ for the same algorithm.

One of the best simple purely functional Sieve of Eratosthenes algorithms is the infinite tree folding sequence algorithm as implemented in Haskell. As Julia does not have a standard LazyList implementation or library and as a full memoizing lazy list is not required for this algorithm, the following odds-only code implements the rudiments of a Co-Inductive Stream (CIS) in its implementation:

const Thunk = Function # can't define other than as a generalized Function

At about 1.8 seconds or 4000 cycles per culling operation to calculate the number of primes up to a million, this is very slow, but that is not the fault of Julia but rather just that purely functional incremental Sieve of Eratosthenes implementations are much slower than those using mutable arrays and are only useful over quite limited ranges of a few million. For one thing, incremental algorithms have O(n log n log log n) asymptotic execution complexity rather than O(n log log n) (an extra log n factor) and for another the constant execution overhead is much larger in creating (and garbage collecting) elements in the sequences.

The time for this algorithm is quite comparable to as implemented in other functional languages such as F# and actually faster than implementing the same algorithm in C/C++, but slower than as implemented in purely functional languages such as Haskell or even in only partly functional languages such as Kotlin by a factor of ten or more; this is due to those languages having specialized memory allocation that is very fast at allocating small amounts of memory per allocation as is often a requirement of functional programming. The majority of the time spent for this algorithm is spent allocating memory, and if future versions of Julia are to be of better use in purely functional programming, improvements need to be made to the memory allocation.

To gain some extra speed above the purely functional algorithm above, the Python'ish version as a mutable iterator embedding a mutable standard base Dictionary can be used. The following version uses a secondary delayed injection stream of "base" primes defined recursively to provide the successions of composite values in the Dictionary to be used for sieving:

The above version can be used and tested with similar code as for the functional version, but is about ten times faster at about 400 CPU clock cycles per culling operation, meaning it has a practical range ten times larger although it still has a O(n (log n) (log log n)) asymptotic performance complexity; for larger ranges such as sieving to a billion or more, this is still over a hundred times slower than the page segmented version using a page segmented sieving array.

The above version is quite slow for a lot of reasons: It includes even number culling even though those will be eliminated on the first pass; It uses a list rather than an array to do the composite culling (both of the above reasons also meaning it takes more memory); It uses enumerations (for..in) to implement loops at a execution time cost per loop. It also consumes more memory in the final result output as another list.

The following code overcomes most of those problems: It only culls odd composites; it culls a bit-packed primitive array (also saving memory); It uses tailcall recursive functions for the loops, which are compiled into simple loops. It also outputs the results as an enumeration, which isn't fast but does not consume any more memory than the culling array. In this way, the program is only limited in sieving range by the maximum size limit of the culling array, although as it grows larger than the CPU cache sizes, it loses greatly in speed; however, that doesn't matter so much if just enumerating the results.

Ah, one might say, for such a trivial range one writes for conciseness and not for speed. Well, I say, one can still save memory and some time using odds-only and a bit-packed array, but write very clear and concise (but slower) code using nothing but higher order functions and function calling. The following code using such techniques can use the same "main" function for the same output but is about two times slower, mostly due to the extra time spent making (nested) function calls, including the function calls necessary for enumeration. Note that the effect of using the "(l .. h).forEach { .. }" is the same as the "for i in l .. h { .. }" as both use an iteration across the range but the second is just syntax sugar to make it look more imperative:

The trouble with the above version is that, at least for Kotlin version 1.0, the ".filter" and ".map" extension functions for Iterable<Int> create Java "ArrayList"'s as their output (which are wrapped to return the Kotlin "List<Int>" interface), thus take a considerable amount of memory worse than the first version (using an ArrayList to store the resulting primes), since as the calculations are chained to ".map", require a second ArrayList of up to the same size while the mapping is being done. The following version uses Sequences , which aren't backed by any permanent structure, but it is another small factor slower due to the nested function calls:

The following Sieve of Eratosthenes is not purely functional in that it uses a Mutable HashMap to store the state of succeeding composite numbers to be skipped over, but embodies the principles of an incremental implementation of the Sieve of Eratosthenes sieving odds-only and is faster than most incremental sieves due to using mutability. As with the fastest of this kind of sieve, it uses a delayed secondary primes feed as a source of base primes to generate the composite number progressions. The code as follows:

At about 370 clock cycles per culling operation (about 3,800 cycles per prime) on my tablet class Intel CPU, this is not blazing fast but adequate for ranges of a few millions to a hundred million and thus fine for doing things like solving Euler problems. For instance, Euler Problem 10 of summing the primes to two million can be done with the following "one-liner":

primesHM().takeWhile { it <= 2_000_000 }.map { it.toLong() }.sum()

to output the correct answer of the following in about 270 milliseconds for my Intel x5-Z8350 at 1.92 Gigahertz:

Output:

142913828922

An unbounded Page Segmented Sieve of Eratosthenes that can output a sequence (iterator)

The very fastest implementations of a primes sieve are all based on bit-packed mutable arrays which can be made unbounded by setting them up so that they are a succession of sieved bit-packed arrays that have been culled of composites. The following code is an odds=only implementation that, again, uses a secondary feed of base primes that is only expanded as necessary (in this case memoized by a rudimentary lazy list structure to avoid recalculation for every base primes sweep per page segment):

// sequence over primes from above page iterator;// unless doing something special with individual primes, usually unnecessary;// better to do manipulations based on the composites bit arrays...// takes at least as long to enumerate the primes as sieve them...fun primesPaged(): Sequence<Prime> = sequence { yield(2L) for ((low,cmpsts) in makeSievePages()) { val szbts = cmpsts.size.shl(3) for (i in 0 until szbts) { if (cmpsts[i.shr(3)].toInt() and 1.shl(i and 7) != 0) continue yield(low + i.shl(1).toLong()) } }}

For this implementation, counting the primes to a million is trivial at about 15 milliseconds on the same CPU as above, or almost too short to count.

It shows its speed in solving the Euler Problem 10 above about five times faster at about 50 milliseconds to give the same output:

It can sum the primes to 200 million or a hundred times the range in just over three seconds.

It finds the count of primes to a billion in about 16 seconds or just about 1000 times slower than to sum the primes to a range 1000 times less for an almost linear response to range as it should be.

However, much of the time (about two thirds) is spent iterating over the results rather than doing the actual work of sieving; for this sort of problem such as counting, finding the nth prime, finding occurrences of maximum prime gaps, etc., one should really use specialized function that directly manipulate the output sieve arrays. Such a function is provided by the `countPrimeTo` function, which can count the primes to a billion (50847534) in about 5.65 seconds, or about 10.6 clock cycles per culling operation or about 210 cycles per prime.

Kotlin isn't really fast even as compared to other virtual machine languages such as C# and F# on CLI but that is mostly due to limitations of the Java Virtual Machine (JVM) as to speed of generated Just In Time (JIT) compilation, handling of primitive number operations, enforced array bounds checks, etc. It will always be much slower than native code producing compilers and the (experimental) native compiler for Kotlin still isn't up to speed (pun intended), producing code that is many times slower than the code run on the JVM (December 2018).

function sieveE int set itemdel to comma local sieve repeat with i = 2 to int put i into sieve[i] end repeat put 2 into n repeat while n < int repeat with p = n to int step n if p = n then next repeat else put empty into sieve[p] end if end repeat add 1 to n end repeat combine sieve with comma filter items of sieve without empty sort items of sieve ascending numeric return sieveend sieveE