Small (till 2^64) prime numbers in C++

This is a small package for working with small (till 264,
not enough for cryptography) prime numbers. They are organized as a
static members of the class z64 (there are no dynamic members in this class).

Only odd numbers are stored in a sieve, one bit for each.
Therefore, N bytes is sufficient for prime numbers up to 16*N+1.
To store all primes up to 1 million requires 63K bytes.
1G bytes enough for primes till 16 billions.

Usign sieve (till maxP()) is easily implemented funtions isPrime, NextPrime, PrevPrime.
But we need some additional date for functions p() (ithprime) and Pi().
Simultaneously with the construction of the sieves are remembered in a separate array values of Pi(n)
(the number of primes, not greater, than n) for all multipiers of 1024.
So, for sieve itself till N we need N/16 bytes and N/256 bytes for this additional table.
Using this table, we have fast functions

For numbers till z64::maxN() we use seive. If n<232 and not in the sieve, we use
Miller–Rabin primality test by bases
2 and 3, and after that check all 103 composite number (unsigned BCompos[103]) less then
232 the method wrong. So, for numbers less 232 this functiuon is exact.
For larger numbres we use 12 Miller-Rabin tests by bases 2,3,..,37.

mul_mod This is a key feature of the package. Its assembler implementation accelerates work
more than an order of magnitude. An assembler implementation is verified only for gcc-64/Linux.

After some research and a large number of experiments in our range (till 264)
among the elementary methods the most effective is
Pollard's-\rho algorithm,
all the rest are not worth to implement.
If all the prime factors of a number are great enough (till some billions),
the method may take up to 2-3 hundred thousand steps. But other methods - still even more slowly.
But, for example, the prime decomposition of p-1 for simple p is easily.

In addition, the method doesn't work on squares of primes, so consider them separately:

This function uses always the same static memory to form string. That is, this funtions always return one
and the same pointer. Memory does not need to allocate or not released.
Just remember that when you call this function, it again uses the same memory.
If you need to convert to a string many numbers, you can use the function

static void toString(char *s, UINT64 n, int w=0); // s
But to allocate sufficient memory, then it will have to take care yourself.

Basically, the module is designed for educational purposes, and speed was not the main goal.
However, for numbers in 64 bit, this functions are faster than the library
GMP. For example, the computation time for function pow_mod
by processor AMD Athlon II X2 240, 2.8 GHz for 63-bit numbers: