pdenisowski has asked for the
wisdom of the Perl Monks concerning the following question:

Greetings all,

I have a very large UTF8 Vietnamese text file that I would like to sort in alphabetical order. The problem is that it seems almost every word processor, utility, etc. out there does not use what I would consider to be the "normal" Vietnamese alphabetical order, usually because it ignores the tone marks (dấu) or puts them in the wrong/random order.

For example, for the first 3 letters of the Vietnamese alphabet I would like to use this sort order:

aáàảãạăaáàảãạăắằẳẵặâấầẩẫậ

I've looked at all the different modules, etc. but none of them seem to do this "correctly" (the way most printed dictionaries do). I've also looked at dozens of web pages and can't make any of those examples work properly either.

Any ideas? I've struggled with this for years and would be eternally grateful to anyone who could figure this out.

Thanks,

Paul

(Here is the complete list of letters in the order in which I wish to order them)

Thanks, but there are two issues : (1) that's still not the correct sort order (á should come before à), and (2) I actually get a different "sorted" list when I run the same exact code. This is the problem that I have - it seems the sort algorithms ignore the tone marks.

Given the file is large, you can try using Sort::External supplied with the following sortsub: sub { $index{$Sort::External::a} <=> $index{$Sort::External::b} }
Where %index is a hash with letters from the vietnamese alphabet as its keys and their corresponding positions in the alphabet as its values.

It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical); it should not be difficult to add once someone figures out a suitable transliteration that sorts asciibetically.

It's still missing a correct 'secondary sort' (for the edge case when the diacritic-stripped words are identical);

(laughs) That's hardly an "edge case" in Vietnamese - there are thousands of minimal pairs where the only difference between the words is the diacritical marks. While it's possible to read and understand Vietnamese typed in (7-bit) ASCII without too much ambiguity (i.e. you can figure out what word is meant from the context), this obviously wouldn't work for a dictionary.

The other issue is that the words in the dictionary need to be sorted in the "correct" order for me to detect duplicates, etc.