Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets

Current approaches to word sense disambiguation use (and often combine)various machine learning techniques. Most refer to characteristics of the ambiguityand its surrounding words and are based on thousands of examples. Unfortunately,developing large training sets is burdensome, and in response to thischallenge, we investigate the use of symbolic knowledge for small datasets. A naiveBayes classifier was trained for 15 words with 100 examples for each. Unified MedicalLanguage System (UMLS) semantic types assigned to concepts found in the sentenceand relationships between these semantic types form the knowledge base. The mostfrequent sense of a word served as the baseline. The effect of increasingly accuratesymbolic knowledge was evaluated in nine experimental conditions. Performancewas measured by accuracy based on 10-fold cross-validation. The best conditionused only the semantic types of the words in the sentence. Accuracy was then onaverage 10% higher than the baseline; however, it varied from 8% deterioration to29% improvement. To investigate this large variance, we performed several followupevaluations, testing additional algorithms (decision tree and neural network), andgold standards (per expert), but the results did not significantly differ. However, wenoted a trend that the best disambiguation was found for words that were the leasttroublesome to the human evaluators. We conclude that neither algorithm nor individualhuman behavior cause these large differences, but that the structure ofthe UMLS Metathesaurus (used to represent senses of ambiguous words) contributesto inaccuracies in the gold standard, leading to varied performance of word sensedisambiguation techniques.