I'm trying to sort a text file linewise by their Unicode values. As far as I can tell, this means numerals first, then letters, then CJK-Ideographs. However, sort (with LC_ALL=C) fails horribly at this task. Here is an excerpt from my list:

It seems like sort ignores (at least sometimes) the characters it can't read, because Altneuland would indeed be between Alternative and Amateras Records. Someone suggested using msort, but it failed as well (with options -u c, -u d, and -u n, respectively).

LC_ALL=C is for ASCII/7-bit, it's pretty much guaranteed to do the wrong thing for multi-byte characters. Which unicode encoding? (UTF-8, UTF-16, UTF-32, legacy UCS-x?). GNU sort with a correctly set locale is almost certainly up to the task.
–
mr.spuraticMar 13 '13 at 22:31

@mr.spuratic, at least all of those encodings are meant to sort the same when regarded as their byte value and sorting by byte value which the C locale is meant to do. locales other than C and POSIX don't sort by byte value but follow language specific rules.
–
Stéphane ChazelasMar 13 '13 at 22:43

@StephaneChazelas yes, of course, but I mean specifically the input file encoding, how can UTF-16 or UTF-32 sort "correctly" if the endianness is not specified?
–
mr.spuraticMar 13 '13 at 23:00

Oh yes, you're right, I forgot about the endianness issue, not to mention the fact that utf16/ucs2 newline characters will be on two bytes, so the content would be mangled by LC_ALL=C sort.
–
Stéphane ChazelasMar 13 '13 at 23:03