Extension Proposal

Input Sequence Correction

OpenOffice.org type and replace, as defined by James Clark et al., in a specification document (somewhere?) and implemented in OpenOffice.org ExtendedInputSequenceChecker

A few standard keyboard shortcuts?

Especially to control word/line breaking and/or ligatures/forms.

Non-breaking space - ctrl+space (from OOo)

No-width optional-break - ctrl+/ (from OOo)

No-width no break (from OOo)

I use "no-width no break" in OOo very often to control line breaks for words not in the dictionary, but currently there is no shortcut for it. If we implement the functionality in other systems like GNOME, we should use the same key for all the systems.

iso-codes as a resource for country names, language names, monetary units, etc.

Characters for Listing / Enumeration

Define a (recommended) set of characters that will be used for listing ?

ก. …
ข. …
ค. …

for example, should characters like ซ ฏ ฎ ฃ ฅ ฆ … be allowed ?
Those mentioned characters share one common visual property - a *zigzag* at its head, *that's the only visual property that distinguished them from their twins* - ช ข ค ม … — but at small print, it is difficult to notice this and distinguished.

And how about the obsolete "ฦ", which should always excluded? All of the above are included in Thai-letter "Bullet and Numbering" sequence in OpenOffice.org because it is impossible to find any reference back then.

Used in word processing, office suite softwares.

In thailatex and gnome-doc-utils, the skipped characters are ฃ ฅ ฆ ฤ ฦ.

In the page numbers of a preamble of the Royal Institute dictionary 2525 B.E., skipped characters are ฃ ฅ but not ฆ. (Last page is ด.) However the 2542 B.E. edition skips none. (Last page is ฒ.)

Recommended Criteria ?

should only be consonant ? - so ฤ and ฦ (vowels) are excluded

should easily be distinguished by shape ? - zigzag twins are excluded, ข but not ฃ, ค but not ฅ

By this criterion, ต is also skipped in favor of ด?

Probably, adapting this criterion to only skip the obsolete ฃ and ฅ should be enough.

should has different read-out ? - ค but not ฆ, ถ but not ฐ, ท but not ฒ, ต but not ฏ, ด but not ฎ, ย but not ญ, ช but not ฌ

This may only apply to short lists, like answer choices in school tests. But for long lists, like page numbers, it will be quickly exhausted.

cons: by these limitations, we will have smaller set of characters to use and will quickly need to use double chars like กก กข กค ….

Skipping some consonants may be important only for first ones. Thus ฆ could arguably be skipped, to be consistent with school test answer choices. But for later sequences, people may not care about the difference between individual items. Rather, completeness for countability may become important as the list grows.

Line segmentation (Line wrapping)

Line breaks are logically possible line breaks, actual line breaks are usually determined based on display width. Line break is useful for word wrapping text.

AFAIK, one pattern is problematic in UAX #14 definition. I'm not sure whether UAX #14 break "นาง(สาว)" before "(" but for "ดี ๆ" it will break before "ๆ". This should be specified precisely in our standard, including interaction with no-width option break and no-width no break.

note: this is display-oriented, and not intended for general text processing

Line-break iterator (segmentator) may purpose more than one break position, it is the work of page layout engine to decide which position is best.

Hyphenation

In line segmentation, especially for multi-columns page, sometimes it's to find an optimum line break that falls exactly at word break. In this case, it is possible that line break will falls at the middle of word. This is where hyphenation come into the scene.

Sentence/Clause segmentation

note: the reason we didn't put it just simply "sentence segmentation" here is because some linguists still skeptical whether Thai has a construct like "sentence" ? See Thoughts on Word and Sentence Segmentation in Thaihttp://pioneer.chula.ac.th/~awirote/ling/snlp2007-wirote.pdf which suggested that "clause" may be more probable.

Question: does it possible to do this with rule-based approach ? corpus-based approach has been explored e.g. The Automatic Thai Sentence Extraction (2000), Sentence Break Disambiguation for Thai, but that would that be too-heavy and impractical ? Will of course have problem to push the corpus-based implementation into any international OSS project. (Mozilla used to rejected dictionary-based word segmentation from Samphan)

One answer: It can be possibly carried out by rules in a clearer level of text tokens such as phrase. The corpus-based is anyway more applicable by its deterministic feature. However at present, it is mostly domain dependent due to the limitation of Thai language resources for training. It is still an important research topics not only in engineering but also linguistic fields in several research institutes. It is hence difficult to put on standard.

Apart from specifications, should WTT provide some recommendation, suggestion on something that can't be put on standard (like this) ? So it will ease developer on which direction should they go.

e.g. even we can't say, what is should be exactly, we can recommend what it shouldn't be.

Searching/Matching/Comparison

any issue on Thai searching ? Does find dialog box works well for you currently ?

One requirement for Thai (or any CTL) searching (from Unicode specification) is that the result text must begin and end on cluster boundary. That is searching for "ที" will not match "ที่". This is the normal behavior in CTL-enabled word processor like OpenOffice.org or Microsoft Office.

Further issue is that whether searching for "กา" match "เกา"?

Soundex

related to Phonetic mark-up?

Phonetic mark-up well supports soundex search but soundex does not necessarily requires phonetic mark-up if we have an automatic phonetic generation. Soundex can be standardized given several methods:

Thai Transliteration/Romanization algorithm

related to Phonetic mark-up?

Should this be in WTT 3.0 scope ?
Thai romanization method has been coded by Thai Royal Institute many years ago. It has been used somewhere like transliteration engines at NECTEC and CU. Romanization by RI is based mainly on phonetic transcription with a phonetic-roman character mapping table.

Instead of saying character boundary, should we leave the word "character" for the concept that can be represent by a code point, i.e. a single TIS-620 character. And then use "cluster" or (in Unicode term) "grapheme cluster" for the combining character sequence like กี่, กู้ or กำ

One pattern that is always an issue is whether "กำ" or "น้ำ" one or two clusters. In OpenOffice.org and Microsoft Office, they treat the pattern as one unit.