I've been running a search for Mordell curves of rank >=8 for about 12 months and have identified approximately 280,000 curves in our archivable range, amongst many millions that aren't.

For this search I have been utilizing up to 60 3Ghz+ cpus at any one time. Right now I'm reaching the point of decreasing return and greatly increased memory requirements. As such, I need to be a little smarter in some of the maths.

My question is, then, what is believed (or known) to be the fastest factoring algorithm for positive integers < 2^60.

I'm not overly keen to consume further multiple GHz decades of processing power unless I can get a reasonable return on investment, for which a fast factoring algorithm would certainly help.

Any ideas are more than welcome.

EDIT:

Reading through the responses makes me realise that I should probably add that I want to keep the factoring within 64 bit arithmetic. The Pollard Rho algorithm was interesting, but probably would exceed the 64 bit limit during its execution.

Another part of the puzzle it that, from the factorisations, I'm storing the differences of divisor pairs for each number factored. This may, potentially, leave me with an array of about 50,000,000 values which then subsequently needs to be sorted.

I have been using Pari/GP for this project and, up till now, it's been great. My current problem is mainly with memory usage, with each Pari/GP task taking over 8GByte of memory. This is preventing my 32 core machine from being as efficient as it may otherwise have been. Hence my intent to move to 64 bit arithmetic and 'C' code, to hopefully gain efficiencies in both time and space, thus breathing new life into an otherwise stalling project.

Update:

Thank you to all those that responded with so many good suggestions on how to proceed.

Eventually I've decided to use the flint library, as suggested by William Hart, rather than try to re-invent the wheel. The ability of flint to work directly with 64 bit integers gives me a great advantage as regards memory usage and speed when compared to my current setup. In particular I can now run all 32 cores on my main machine and still have memory left over, potentially giving an 8 fold improvement on processing throughput.

I don't think you will get a satisfying answer. If you know something about how the integers are distributed, you may find some economies of scale by sieving large intervals or sieving arithmetic progressions. If you are just being thrown integers willy-nilly, trial factorization by doing gcd against primorial fragments may be your best bet for speed improvements. You might gain some ground by trying to factorize products of the integers, but without knowing how they are distributed, it might slow things down. Gerhard "Ask Me About System Design" Paseman, 2012.11.20
–
Gerhard PasemanNov 21 '12 at 4:23

There are roughly 51 million primes less than $2^30$, so a trial factorization of one number on a laptop should be doable in anywhere from under a minute to under a day. Many numbers (perhaps 65% or more of those with 60 binary digits) have all factors under 2^40, so a lot of the numbers can be completely factored with trial division in under an hour. Knowing this, even to within an order of magnitude, does no good unless we know how many numbers he wants to factor and if they are of some advantageous form or not. Gerhard "Ask Me About Rough Mathematics" Paseman, 2012.11.20
–
Gerhard PasemanNov 21 '12 at 6:27

1

@Gerhard: I'll need to be able to perform about 4 million such factorisations in a day. The numbers are all of the form $x^3-x'^3$, with $x > 500,000$ and $0 < x' < x$, so currently $2^{57}$ is the worst case.
–
Kevin AcresNov 21 '12 at 7:23

6

I assume you are not able to interface directly with C code. If you were, you could use the n_factor function in flint (flintlib.org). First use the algebraic factorisation suggested above, then factor the remaining parts with flint. According to timings I have, you should easily hit 20,000,000 factorisations an hour on recent hardware.
–
William HartNov 21 '12 at 12:46

11 Answers
11

Shanks's SQUFOF (Square FOrm Factorisation) might well be worth a detailed look.

Henri Cohen comments:

This method is very simple to implement and has the big advantage of working exclusively with numbers which are at most $2\sqrt{N}$ [...] Therefore it is eminently fast and practical when one wants to factor numbers less than $10^{19}$, even on a pocket calculator.

And $2^{60}$ is just below $10^{19}$; and indeed the $10^{19}$ should arise as the number such that $2\sqrt{N}$ still fits in an "int".
For reference, the complexity is $O(N^{1/4+\varepsilon})$; but this is not the main point.

Typically, this would have to be preceeded with trial division for small factors and (possibly) some primality test.

In general, I recommend the relevant chapters of the book "A course in computational algebraic number theory" by Henri Cohen (from which the quote is taken); he typically in addition to theory and algorithms, also discusses practical aspects of implementation (the book being not recent some of them might have to be modified, but still the practical aspect and discussion of ranges is present).

Also, various other references of SQUFOF are easy to find via a web-search.

On a 32-bit computer SQUFOF is the clear champion factoring algorithm for numbers between $10^{10}$ and $10^{18}$, and will likely remain so.

More recently William Hart proposed a Fermat-variant he called A One Line Factoring Algorithm, which he describes as competitive with SQUFOF in practise (while only being $O(n^{1/3})$ ). In the respective paper, to be precise a preprint thereof I do not have the actual paper, he writes (J. Aust. Math. Soc. 2012)

Most modern computer algebra packages implement numerous algorithms for factoring. For number that fit into a single machine word Shanks' SQUFOF is popular as it has run time $O(n^{1/4})$ with a very small implied constant.

So, for the range asked for in the question I am quite confident that SQUFOF would be a good choice. It should however be noted, this is also discussed in Cohen's book, that as the numbers get larger (beyond the mentioned threshold) SQUFOF becomes unattractive while, eg, Pollard rho stays interesting. The rough reason for this seems to be that SQUFOF does not profit from 'small' factors, as opposed to, e.g. Pollard (cf Laurent Berger's answer). However, for numbers in that range and after trial divison (Cohen then suggested trial division by primes up to 500000) there are not that 'small' factors anyway.

As already pointed out a big plus for SQUFOF is that the involved numbers are only size $2\sqrt{N}$, in contrast to other methods requiring often $N$ or even $N^2$. This affects only the constant in the running time, but this is also important, and in addition in practise allows to get by with simple datatypes for longer.

A rough inspection of flint mentioned by William Hart in a comment suggests that indeed it uses SQUFOF above 2^40 , below it uses a Fermat variant due to him; combined with trial divisions and perfact power checks. (I hope I got this right; the main purpose of this is to reassure myself my answer is not off.)
–
quidNov 21 '12 at 14:15

2

While SQUFOF is good, I confess that I would have expect a good ECM implementation to become better before the $2^{60}$ cutoff.
–
HurkylNov 21 '12 at 19:19

1

@Hurkly: I will come back to this in more detail in two hours or so but I have a 2008 quote handy that says on 32bit SQUOFOF is 'clear champion' in 10^10 to 10^18 range.
–
quidNov 21 '12 at 20:04

@Quid: I've got the book and I'll have a look later. Many thanks.
–
Kevin AcresNov 21 '12 at 20:16

1

One could perhaps combine trial division with a few iterations of Pollard's $\rho-$method: $N=1000$ iterations (or perhaps $2000$ or $3000$) of $\rho$ should catch most prime-factors smaller than $500000$. Fixing a $\rho-$method, one can precompute a (presumably) small list of primes smaller than $500000$ not catched by $N$ iterations of $\rho$. I have however not checked whether this is faster for random elements of size $\sim 2^{60}$ for suitable choices of $N$ and of a $\rho-$method. (A fixed $\rho-$method will of course not always give a complete factorization.)
–
Roland BacherNov 22 '12 at 13:02

A long long time ago I programmed my HP48 (4 MHz! Wow!) in assembly language to factor numbers. It would do trial division by all integers congruent to $\pm 1 \bmod{6}$, up to some bound B and then use Pollard's rho method. I guess in your case you need to figure out what B should be depending on your implementation. At the time, I was using "Prime Numbers and Computer Methods for Factorization" by Hans Riesel as a reference. Since $2^{60}$ is relatively small, I'm not sure that more powerful methods would really help. Pollard's rho method is believed to be in something like $O(p^{1/2})$ where $p$ is the smallest prime factor of your integer, which is quite small in your case.

Within the paper a short pseudocode is given for each method, so you may be able to judge whether they will suit your computation restrictions.

From their experiments, SQUFOF is indeed the fastest up to $2^{60}$ range, where it is tied against SIQS. Note that these experiments are done on integers known to have few prime factors (Something like 2-4).

For your case, it appears that you are factoring random integers up to $2^{60}$.
Hence, except for ECM, it is more efficient to trial factor small primes up to a certain bound.
(See comments below on the case for ECM)

One additional point: ECM (specifically GMP-ECM, also from Inria) might be a good idea since the software is highly optimized. (With a research team coding it for a few years)
On the other hand, the arithmetic is done modulo $N$.
This might mean that there will be overflow during the multiplication steps.

If you choose ECM, I can elaborate a little more.
A common problem is the fact that ECM is probabilistic.
This can be resolved if you know which curves to choose to cover all the prime factors.
Inria has also published its work on this area, showing that 124 carefully chosen curves finds all prime factors up to $2^{32}$.
The numbers are quite suitable for your case.
Statistically speaking though, you find the factors with probability around 0.15 per curve, so usually 20 curves will do the job.
As mentioned earlier, using these parameters you can skip the trial factorization part.
During the search for bigger primes, you will also find the smaller ones in the process.

If you want to push the efficiency even higher, the optimal approach is to first do a filtering based on $P-1$ and $P+1$ factorization method, followed by the actual ECM.
This is a little tougher since you need to calibrate the parameters correctly.

EDIT: Did not notice that all numbers are of the form $x^3-y^3$ with $ 0 < y < x < 50000 $.
There are about $50000^2/2=2^{36.86}$ possibilities.
Out of all these, a portion will be primes.
A portion will be smooth with factors $< B$ for some $B$ that can be easily trial divided.
The remain cases are composites with primes $> B$.

Suppose you have $n$ remaining entries.
Sort them in the following format, by $N_i$:
$N_i L_i$, where $N_i$ is the entry's value, each a composite of factors $> B$, and $L_i$ is the offset to find the factorization.
At offset $L_i$, store the distinct prime factors (may choose not to store multiplicity by doing trial division).

Each time you get a number:
1) Do primality test
2) Do trial division
3) Do another primality test
4) If residue remains, do a binary search on the factorization

I think this should be able to handle your 4 million factorization.
Clearly there are more ways to tweak the method, but the idea here is you can do a one time computation and store the difficult computations as a look up table.

Thanks for the interesting paper. I just find it puzzling you say INRIA has published a paper; this seems like a normal preprint by Jérôme Milan (who happens to be affiliated with INRIA). A quick question: do you suggest to run GMP-ECM without any prior trial divison for very small factors? This seems surpring to me, but I could well be wrong. If not, what would you suggest as very small? (My point is I took the 500000 from Cohen, but if one would even only go to 10000, one would be essential in a four-factor situation already, then). Also, ...
–
quidNov 22 '12 at 7:15

1

the advantage if SQUFOF up to 50 bit seems quite drastic (in the graphic this looks like an order of magnitude) and stays very large for up to 55 bits; and in reality OP is doing it seems not 60 bit but 57 max, and possibly often even less due to the algebraic factorization being present.
–
quidNov 22 '12 at 7:20

@quid I am not too sure how the association works. I mentioned INRIA since the papers are usually hosted by its webpages and it appears that the authors are linked to the organization. But I guess you are right, probably more correct to quote Jérôme Milan instead. As for trial division testing beyond $2^{10}$ (not sure what actual values are) does not seem very effective. It is still good to trial test the very small ones though, since the time taken is negligible; It should be the medium ones that should not be tested by trial division.
–
Ng Yong HaoNov 22 '12 at 12:37

@quid Roughly speaking, suppose I aim to trial divide a range of primes. Then equivalently, I can set a set of 20 curves to find that same range (20 curves to get a high prob. of >0.95). So any time the 20 curves takes less time than the trial division, switching to ECM is faster. Unfortunately I did not find that leveling point. Also, I agree that given OP's range, after trial testing the small ones the case is usually the same as the paper (2-4 factors) so SQUFOF seems like the better choice. My suggestions/comments are for the case when he wants to do it via ECM.
–
Ng Yong HaoNov 22 '12 at 12:57

1

@quid I have some figures: it takes 0.025 sec on my x64 2.00 GHz laptop to run 20 curves to factor primes $<2^{32}$, but it takes negligible time to trial divide up to 500000. I can probably increase the limit a little more, up to a point where the run time is significant. But the expected efficiency is small, so I think there is probably not much difference anyway. So upon reflection: 500000 as a limit seems like a decent idea. I think the ideal number also depends on the range ECM handles. In this case the small primes will be found by the curves, so the limit does not need to be that high.
–
Ng Yong HaoNov 22 '12 at 15:18

The second step is to factor the elements in $S$. There can potentially be primes of up to $2^{59}$ in $S$, but we only need to worry about primes up to $2^{30}$. Elliptic Curve Factorization methods are probably the method of choice.

In case of relevance: I am assuming that you are sifting through your 280,000 curves and throwing out those that are not suitable. The question is, do you need a complete factorization of your numbers to rule out a curve? For example, if I want to discard any number that is not the sum of two squares, I can very rapidly get rid of a high percentage by trial division by primes up to 1000. That is, if there is a prime $q < 1000$ with $q \equiv 3 \mod 4$ and my test subject number $n \equiv 0 \pmod q$ but $n \neq 0 \pmod {q^2},$ then i know that $n$ is not the sum of two squares. And, for such numbers, I have not needed to perform a full factorization. In fact, any occurrence of an odd power of such $q$ will decide the question.

Well, that has generally been how I have gotten things done, trial division up to a small bound for initial screening, up to a larger bound for those that remain, finally appeal to elliptic curve for the stubborn ones.

Let's see, $2^{60} \approx 10^{18}.$ That is really not so bad. First screening by primes up to $1024 = 2^{10}.$ Second screening by primes up to $1,048,576 = 2^{20}.$ If any number is not completely factored by now, it is the product of two primes only.

@Will. The 280,000 are what are left after identifying a curve, ensuring its not a duplicate and then checking that it is in our capture range. Some days I identify a few million curves, of which maybe only a few hundred will be archived. The factorisations are just one of the steps in the processing, following which I need to work out the differences in divisors (as in mathoverflow.net/questions/107719). Currently I'm using Pari/gp for this work, but I now want to try and find an efficient C/assembler solution, or maybe it's even time to invest in a Tesla K20.
–
Kevin AcresNov 21 '12 at 7:16

1

As Pari/gp has been developped by Henri Cohen, I would expect it is reasonably optimal for this kind of task...
–
Feldmann DenisNov 21 '12 at 12:36

@Kevin, I am quite fond of C++ with STL and GMP. Evidently it is also possible to call Pari routines. It all fits, Pari uses GMP for multiple precision integers and so on. Admittedly, part of the trouble is that i have never figured out how to program in Pari itself. It is very nice, for example, to have the STL "set" template type, which can fit up to about 20,000,000 user-defined "things" (on my little computer) before drastically slowing down. A "thing" could be an ID code for the curve, with the large integer, a partial factorization, and a flag saying completely factored/not yet.
–
Will JagyNov 21 '12 at 19:25

1

@Will: I'm trying, if possible, to keep everything in 64 bits and so avoid the need of a multi-precision integer library. Also my previous experiences with STL has been that things went a little slower than I had hoped. In this instance I will regularly be storing and processing in excess of 50,000,000 values per pass of the program.
–
Kevin AcresNov 21 '12 at 20:22

@Kevin, alright, I made a separate answer, but if you already have experience with STL the information is redundant. Plus I do not really think I understand the, well, architecture of your problem. I did notice that your 64 bit idea made GMP unnecessary. One thing I do know is that mathematics programming takes time to get right.
–
Will JagyNov 21 '12 at 20:35

After your edit: another ad for C++ STL. The "set" template type is implemented as a red/black tree which sorts at each insertion step. It is also dynamic memory allocation. Meanwhile, the user creates a "class," each instance will be a node in the tree, and the user defines the two relations == for equality and < for strictly less than.

As I said in a comment, each node could have some ID number for the curve, the number you are factoring, perhaps a complete factorization (it appears you wish to continue with a curve only if you have a complete factorization), and the set of divisor differences, perhaps implemented as another "set" if you like.

This is largely opposite to difficult factorizations: the number of divisors $d(n)$ of some positive integer $n$ satisfies
$$ d(n) \leq n^{\left( \frac{\log 2}{\log \log n} \right) \left( 1.5379396\ldots \right)} $$
or roughly $ n^{0.28596} $ up to $2^{60}.$ So a superior highly composite number of that size could have about 146214 divisors. Oh, dear. Well, those are rare and instantly factored.

@Will: I've had previous disasters (speed wise) with red/black trees. So now I code my own algorithms for this type of work. This stage of the processing is far removed from the final curves, I'm just factoring, storing divisor pairs, sorting the array, finding and storing multiples then proceeding to the next number.
–
Kevin AcresNov 21 '12 at 20:47

@Kevin, I see. I'm a big fan of having other people do the work for me.
–
Will JagyNov 21 '12 at 20:50

A good source for highly efficient algorithms and implementations for this kind of problems is Dan Bernstein's homepage. There I found an algorithms that might be useful for weeding out all the small prime factors:

If you have $y/(\log y)^{O(1)}\;$ integers, each with at most $(\log y)^{O(1)}\;$ bits, then you can find all the small prime factors of each integer in time $(\log y)^{O(1)}\;$ per integer.

[I didn't look at the details, so you will have to give it a try to see if 60-bit numbers are already big enough.]

Neil Koblitz's book "A Course in Number Theory and Cryptography" has a good survey of factoring algorithms including several that work well at this size range. Elliptic curve methods are covered (they were not in Knuth vol 2 last time I looked) and might be reasonable candidates.

This is really a comment, but what the heck. I would strongly advise against implementing your own factoring. Systems such as Pari/GP are very carefully optimized, and it would take you months, if not years, to get as efficient. 8GB of ram is nothing these days (the laptop I am typing this on has 16GB), so you are better off spending $300 to double your RAM than spending an unbounded amount of your time on a wild goose chase.

@Igor: This is a 32 core machine that I'm coding for, equipped with 32GByte of RAM. Using C I can store 50,000,000 64 bit values into 400MByte RAM, so each one of the 32 processes will still have plenty of RAM left over. Running pari/gp means that I can only reliably run 3 processes, with each one using over 8GBytes.
–
Kevin AcresNov 22 '12 at 4:28

I have found that pari/gp is always an excellent tool for prototyping and good for research up to a point. Another instance were it 'maxed out' was in this project: listserv.nodak.edu/cgi-bin/… . Here I hand coded the agm functions and achieved much better speeds that pari/gp was able to offer at the time.
–
Kevin AcresNov 22 '12 at 4:49

Remember that pari/go also has a library mode, so you can just make calls to it to factor, and do your own memory management.
–
Igor RivinNov 22 '12 at 17:59