Nine days ago Yacy automatically paused a crawler for lack of disk space, and consequently started post-processing. Since then it has not yet finished, running 24/7.

It perpetually flips from a short phase of CPU burst to a much longer phase of hard disk burst: from all cores at 100% to a full queue of small hard disk read ops.The index size is moderate with 24M records over 250GB. OS files cache is 12GB. JVM heap 4GB.May I do something to speed it up?

Side consideration:Standing solely from the observed behavior, it appears that the post-processing algorithms perform something like a one-to-every comparison between indexed records, in a way similar to:

for record_a in index; do for record_b in index; do one-to-every comparison donedone

If this is actually the case, then the post-process reads over and over the whole index file; since only a fraction of the index fits in the OS cache, this causes a massive amount of small, random, inefficient read ops. Given this hypothesis true, would it be feasible to improve the algorithm by either:

Keep it read the whole index file over and over, as it currently does, but performing sequential reads rather than random ones. Rationale: reading the whole index file sequentially from head to tail is faster than reading it randomly with "partial field reads";

Perform the comparison in large batches rather than with individual records, so avoid the one-to-every check and instead perform a many-to-every check. Consideration: a large batch might consist of anything between 1 MB to 10 GB of indexed records, enough to entirely fit them within available RAM, so to perform multiple comparisons at a time against each record read from disk.

Postprocessing is deactivated by default at least since one year. This process may be interesting to generate SEO data but not for standard YaCy operations.If your postprocessing is still active, deactivate it by going to /IndexSchema_p.html and then deactivate the field process_sxt

raw duplicates are identified by their url hash and that is default behaviour, no postprocessing is needed.'special duplicates' like with/without http(s) and/or with/without leading '.www' are handled by the postprocessing in such a way that one of these variants is set to be visible. Without the postprocessing any of the variant which appears first is preferred and the 'similar' url gets a worse ranking.