I have a PC with Intel(R) Pentium(R) CPU G640 @ 2.80GHz and 8 GByte RAM. Installed Scientific Linux 6.5 on it with EXT3 filesystem.

Q: What is the fastest way I can do a "sort -u list.txt -o list-sorted.txt" on a file that is 200 GBytes big?

Maybe split it to files, smaller then 8 GBytes, "sort -u" them, then they will be smaller, then put them together, then split them again (?) with different sizes, "sort -u" them again, etc. (?) Or are there any sorting scripts, programs that could handle files this big? (and with this limited RAM?)

Please edit your question and explain what happens when you try the command you posted. Do you run out of disk space? The command should work as long as you have enough free space on your /tmp.
–
terdon♦Mar 17 '14 at 18:53

sort does not work in RAM, it stores temp files in /tmp. RAM should not be the limiting factor. Please edit and answer my previous questions and also post the result of df -h.
–
terdon♦Mar 17 '14 at 18:56

The chosen answer basically says what @terdon is saying, but also check out this one - stackoverflow.com/a/13025731/2801913. You will need GNU parallel for this I think rather than the moreutils parallel that is installed by default on some systems.
–
GraemeMar 17 '14 at 19:35

Set the number of sorts run in parallel to n. By default, n is set to the number of available processors, but limited to 8, as there are diminishing performance gains after that. Note also that using n threads increases the memory usage by a factor of log n. Also see nproc invocation.

Since your cpu has 2 cores, you could do:

sort --parallel=2 -uo list-sorted.txt list.txt

It is better to specify the actual number of cores since there may appear to be more due to the processor having hyper-threading.

You could also experiment with nice to influence the processor scheduling priority and ionice to influence I/O scheduling. You can increase the priority over other processes like this, I don't think this will give you large savings as they are usually better for making sure a background process doesn't use too much resources. Never-the-less you can combine them with something like:

Note also that as Gilles commented, using a single GNU sort command will be faster than any other method of breaking down the sorting as the algorithm is already optimised to handle large files. Anything else will likely just slow things down.

sort -u doesn't report unique lines, but one of each set of lines that sort the same. In the C locale, 2 different lines necessarily don't sort the same, but that's not the case in most UTF-8 based locales.

Also using the C locale avoid the overhead of having to parse UTF-8 and processing complex sort orders so would improve performance dramatically.

So:

LC_ALL=C sort -u file

You can also improve performance by using a faster drive (or a different drive from the one where the input and/or output files are) for the temporary files (using -T or $TMPDIR environment variable), or by fiddling with the -S option supported by some sort implementations).

For some type of input or for slow storage, using the --compress-program option of GNU sort (for instance with lzop) might improve performance in addition to storage usage.

Now just a note for those objecting (rightly to some extent) that it will not be the correct order:

I agree that as a human, I'd like to see Stéphane sort in between Stefan and Stephanie, but:

A computer would want Stéphane to sort after since é (at least when expressed as U+00E9) as a character or the bytes of its UTF-8 encoding sorts after (in terms of codepoint or byte value). That's a sort order that is very simple to implement and is a strict total order and has no surprise.

Your locale's sort order will likely not be satisfactory in many cases either even to a human. For example on my system with the default en_GB.utf8 locale:

Stéphane and Stéphane (one with U+00E9, the other with eU+0301) don't sort the same: