Abstract

This research details the development of a novel methodology, called the frequent max substring technique, for extracting indexing terms and constructing an index for Thai text documents.

With the rapidly increasing number of Thai digital documents available in digital media and websites, it is important to find an efficient Thai text indexing technique to facilitate search and retrieval. An efficient index would speed up the response time and improve the accessibility of the documents. Up to now, not much research in Thai text indexing has been conducted as compared to more commonly used languages like English. The more commonly used Thai text indexing technique is the word inverted index, which is language-dependent (i.e. requires linguistic knowledge). This technique creates word document indices on document collection to enable an efficient keyword based search. However, when using the word inverted index technique, Thai text documents need to be parsed and tokenized into individual words first. Therefore, one of the main issues is how to automatically identify the indexing terms from the Thai text documents before constructing the index. This is because the syntax of Thai language is highly ambiguous and Thai language is non-segmented (i.e. a text document is written continuously as a sequence of characters without explicit word boundary delimiters). To index Thai text documents, most language-dependent indexing techniques have to rely on the performance of a word segmentation approach in order to extract the indexing terms before constructing the index. However, word segmentation is time consuming and segmentation accuracy is heavily dependent either on the linguistic knowledge used in the underlying segmentation algorithms, or on the dictionary or corpus used in the segmentation. It is for this reason that most language dependent indexing techniques are time consuming and require additional storage space for storing dictionary or corpus or manually crafted rules resource.

Apart from the language dependant indexing techniques, some language-independent techniques have been proposed as an alternative indexing technique for Thai language such as the n-gram inverted index and suffix array approaches. These approaches are simple and fast as they are language-independent, and do not require linguistic knowledge of the language, or the use of a dictionary or a corpus. However, the limitation of these techniques is that they require more storage space for extracting the indexing terms and constructing the index.

To address the above limitations, this thesis has developed a frequent max substring technique that uses language-independent text representation, which is computationally efficient and requires small storage place. The frequent max substring technique improves the performance in terms of construction time over the language-dependent techniques (i.e. the word inverted index) as this technique does not require text pre-processing tasks (i.e. word segmentation) in extracting the indexing terms before indexing can be performed. This technique also improves space efficiency compared to some existing language-independent techniques. This is achieved by retaining only the frequent max substrings, which are strings that are both long and frequently occurring, in order to reduce the number of insignificant indexing terms from an index.

To demonstrate that the frequent max substring technique could deliver its performance, experimental studies and comparison results on indexing Thai text documents are presented in this thesis. The technique was evaluated and compared in term of indexing efficiency and retrieval performance. The results show that the frequent max substring technique is more computationally efficient when compared to the word inverted index, and also that it requires less space for indexing when compared to some language independent techniques.

Additionally, this thesis shows that the frequent max substring technique has an advantage in terms of versatility, as it can also be combined with other Thai language dependent techniques to become a novel hybrid language-dependent technique, in order to further improve the indexing quality. This technique can also be used with a neural network to enhance non-segmented document clustering. The frequent max substring technique also has the flexibility to be applied to other non-segmented texts like the Chinese language and genome sequences in bioinformatics due to its language independency feature.