This module carries out word sense disambiguation (WSD),
which is the process of selcting the correct sense for a word in a given context.
The correct sense is selected from a sense inventory which lists the possible meanings of a word.
This module uses the WordNet lexical database as it's sense inventory.

This directory contains the Perl modules that do the actual work of disambiguation. By default, these files are installed into /usr/local/lib/perl5/site_perl/PERL_VERSION (where PERL_VERSION is the version of Perl you are using). See the INSTALL file for more information.

This directoy contains a number of scripts that let you run word sense disambiguation experiments and reformat data.

These scripts will be install when 'make install' is run. By default, these files are installed into your /usr/local/bin directory. See the INSTALL file for more information. The scripts in this directory are:

This script splits text into sentences. We expect that the input format to allwords should be one sentence per line, one line per sentence. If your text is not in this format, you can use this script to split the text into sentences. Note that this script is not included anywhere and if your text is not in the required format, you should call this script explicitly before using wsd.pl or the web interface.

extracts content words given an answer file (typically a plain text file extracted using extract-semcor-plaintext.pl which has been tagged using a part of speech tagger) and a key file extracted using extract-semcor-plaintext.pl --key option.

This directory contains all of the *pod files used to document the system. These are processed via pod2text and the output of this is placed in the top level directory, although these top level text files should be considered read only.

This directory contains examples of the different formats of data that are supported by this package. It also contains a sample stoplist. There is a README file in the directory that describes the contents in more detail.

This directory contains the allwords web server and interface. There are detailed README and INSTALL instructions within this directory. Installing the web interface is optional, and is separate from installing the main package.

Words can have multiple meanings or senses. For example, the word glass in WordNet [1] has seven senses as a noun and five senses as a verb. Glass can mean a clear solid, a container for drinking, the quantity a drinking container will hold, etc. WSD is the process of selecting the correct sense of a word when that word occurs in a specific context. For example, in the sentence, "the window is made of glass", the correct sense of glass is the first sense, a clear solid.

WordNet::SenseRelate::AllWords extends a word sense disambiguation algorithm described by Pedersen, Banerjee, and Patwardhan [2] by making it disambiguate all words in text. The previous version of the algorithm was intended for lexical sample data, which means that a single word in a context is designated as the target word and is the only word to be disambiguated. By contrast, WordNet::SenseRelate::AllWords will assign a sense to every word known to WordNet that appears in a context.

Prior to execution of the algorithm, we remove any word that is not known to WordNet, and any word that appears in a stoplist. The input to the algorithm is presumed to be a single sentence where non-WordNet words and stoplisted words have been removed. WordNet::SenseRelate::AllWords does not cross sentence boundaries when carrying out disambiguation.

The size of the context window can be specified by the user. A context window of size 3 means that the context window will consist of three words, including the target word. Thus, the three words would be the word to the left of the target word, the target word itself, and the word to the right of the target word. The algorithm will expand the context window so that the three words will be words known to WordNet (the algorithm is unable to disambiguate words unknown to WordNet). For example, if the word 'the', occurs in the context window to the left of the target word, then the window will be expanded by one word to the left.

If the window size is an even number, then there will be one more word to the left of the target word than to the right. For example, if the window size is 4, there will be two words to the left of the target word and one word to the right.

Note that the context window will only include words in the same sentence as the target word. If, for example, the target word is the first word in the sentence, then there will be no words to left of the target word in the context window regarless of the specified window size.

The minimum window size is 2 because a smaller window mean that there are no context words in the window. When the window size is 2, there is no context to use for disambiguating the first word in a sentence. To assign a sense number to that first word, the first sense of the word is chosen (i.e., sense number 1). Sense number 1 is usually the most frequent sense of a word.

Certain measures of semantic similarity only work on noun-noun or verb-verb pairs; therefore, the usefulness of these measures for WSD is somewhat limited. As a way of coping with this problem, WordNet::SenseRelate::AllWords provides an option to "coerce" words in the context window to be of the same part of speech as the target word.

When POS coercion is in effect, if the target word is a noun, then WordNet::SenseRelate::AllWords will attempt to convert non-nouns in the context window to noun forms of the same word. For example, if the target word is a noun and the verb love occurs in the window, the module might convert that word to the noun love.

WordNet::SenseRelate::AllWords first uses the validForms method from WordNet::QueryData to find any valid forms of the word being coerced that are of the desired part of speech. In the case of part of speech tagged text, the POS tags are discarded. If validForms did not return any forms of the desired part of speech, then the derived forms relation in WordNet is used to find possible forms of the word. If neither of these methods returned usable forms, then no further attempt is made to coerce the word to be the desired part of speech.

1 Show the context window for each pass through the algorithm.
2 Display winning score for each pass (i.e., for each target word).
4 Display the non-zero scores for each sense of each target
word (overrides 2).
8 Display the non-zero values from the semantic relatedness measures.
16 Show the zero values as well when combined with either 4 or 8.
When not used with 4 or 8, this has no effect.
32 Display traces from the semantic relatedness module.

Different trace levels can be combined to achieve the desired behavior. For example, by specifying a trace level of 3, both level 1 and level 2 traces are generated (i.e., the context window will be shown along with the winning score for each pass).

The output of wsd.pl is simply the disambiguated words. The output will be in the form word#part_of_speech#sense_number. The part of speech will be one of 'n' for noun, 'v' for verb, 'a' for adjective, or 'r' for adverb. Words from other parts of speech are not disambiguated and are not found in WordNet. The sense number will be a WordNet sense number. WordNet sense numbers are assigned by frequency, so sense 1 of a word is more common than sense 2, etc.

Sometimes when a word is disambiguated, a "different" but synonymous word will be found in the output. This is not a bug but is a consequence of how WordNet works. The word sense returned will always be the first word sense in a synset (synonym set) to which the original word belongs.

Raw text that is not part of speech tagged. This text should be formatted so that there is one sentence per line, one line per sentence. For example:

Red cars are faster than white cars.
However, white cars are less expensive.

Except for a few cases, punctuation will be ignored and will be replaced with a space character. The compounds known to WordNet will be identified automatically (e.g., 'winston churchill' will be recognized as a compound and converted to winston_churchill).

All characters other than the characters from the set {\s,-,a-z,A-Z,0-9,_,',\n} will be removed from words except for the user identified compound words. For example, consider the following sentence

St._Petersburg is a city in Pinellas County, Florida, United States.

In this sentence, the period from the word St._Petersburg will not be removed. However, the period from 'United States.' will be removed. If the user identified compound occurs at the end of the sentence, for example 'St._Petersburg.', then the period at the end of the compound is considered as the end of sentence marker and will be removed.

Punctuation will not be removed for the words 'i.e.' and 'et_al.' because in this case, if the puctuation is removed, 'i.e.' will be treated as two different words 'i' and 'e'. This is not expected as i.e. and et_al. are defined by WordNet. We include these two words because we found those in SemCor 3.0 while doing the experiments. If your text has words like this then you can include those in wsd.pl code.

Words that are not tagged will be ignored even if they are known to WordNet. Punctuation is ignored. Compounds will not be automatically identified, they must be specified by the user (e.g., winston_churchill/NNP, red_tape/NNS).

Identical to tagged text, except that the part of speech tags are from the WordNet tag set, which limits them to 'n', 'v', 'a', or 'r', for nouns, verbs, adjectives and adverbs. This text should be formatted so that there is one sentence per line, one line per sentence. For example:

Words that are not tagged will be ignored even if they are known to WordNet. Punctuation is ignored. Compounds will not be automatically identified, they must be specified by the user (e.g., winston_churchill#n, red_tape#n).

Additionally, no attempt will be made to search for other valid forms of the words in the input. For example, if 'dogs#n' is in the input, the program will not attempt to use other forms such as 'dog#n'.

The different options and parameters for wsd.pl are discussed in detail in the documentation for wsd.pl. Run 'perldoc wsd.pl' to view the documentation.

The context parameter to disambiguate() specifies a set of words to disambiguate. The function treats the context as one sentence. To disambiguate multiple sentences, make a call to disambiguate() for each sentence.

The usage of the disambiguation module is discussed in detail in the documentation for the module. Run 'perldoc WordNet::SenseRelate::AllWords' or 'man WordNet::SenseRelate::AllWords' (after installing the module) to view the documentation. To view the documentation before installing the module, run 'perldoc lib/WordNet/SenseRelate/AllWords.pm'.

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.