Report of the Indo-Wordnet Workshop, 14-16 January, 2003
--------------------------------------------------------
Recognizing the immense importance of lexical resources, the Indian languages wordnet workshop was jointly organized by the Central Institute of Indian Languages (CIIL) Mysore and Indian Institute of Technology (IIT) Bomay from the 14th to the 16th of January, 2003. The objective of the workshop was to explore methodologies for constructing the wordnets for Indian languages and then linking them internally to produce the Indo-wordnet which eventually would be linked to the English wordnet and the Euro-Wordnet (a conglommeration of European languages' wordnets). It is now an accepted fact that no meaningful research and development in language processing, information extraction and machine translation can be carried out without wordnets. In India, wordnet building activities are going on for Hindi and Marathi at IIT Bombay, Tamil at Anna University Knowledge Based Center (AU-KBC) Chennai and Taml University Tanjavur, Gujarathi at MS University Baroda, Oriya at Utkal University Bhubaneswar and Bengali at IIT Kharagpur. The Hindi wordnet is at an advanced stage of development with about 11000 semantically linked synsets and with the associated software and the user interface.
On the first day, the Director of CIIL, Prof. Uday Narayan Singh welcomed the participants representing all the major languages of India. Prof. Singh stressed the need for utilizing the enormous amount of linguistic work in the country for the purpose of wordnet building. He strongly recommended the setting up of a website where related information, software and resources will be kept in a browsable and freely downloadable form. Dr. Jayaram of CIIL and Dr. Pushpak Bhattacharyya of IIT Bombay explained the goal of the workshop and described the milestones that are expected to be achieved in a year's time.
After this, a day-long tutorial was delivered by Dr. Bhattacharyya on the fundamentals, methodolgies and the applications of the wordnet. In this he was assisted by the IIT Bombay wordnet group members attending the conference- Mr. Prabhakar Pande, Mrs. Sraddha Mahapurush and Mrs. Bapat. The first wordnet of the country- the Hindi wordnet- is being built at IIT Bombay. The concepts of (i) Synsets, (ii) Semantic Relations and (iii) the Interface of the wordnet were explained. Since the synsets are the building blocks of the wordnet, considerable amount of time was spent on describing the structure, principle of creation and the associated parts of a synset. It was repeatedly stressed that the words may be polysemous, but when more than one synomymous word is put together, a unique meaning emerges. For example, the synset {ghar, griha and makaan} denotes the unique sense of "residence". This sense is attached as a "gloss" and is exemplified by a simple sentence. For example in the synset
{ghar, griha, makaan, aalay, sadan}, 'manushya kaa aavaassthal', "raam kaa ghar mandir ke paas hai".
-synset- -gloss- -example
the gloss and the example are shown as above. The gloss plays a very important role in the wordnet since it is through this that the synsets are linked across wordnets. Thus, in the Indo-wordnet, the language specific wordnets are expected to have identical glosses and examples as far as possible. The advantage of this is the possibility of creating a multiway parallel corpora.
Dr. Bhattacharyya stressed that the synsets should be constructed abiding by the three principles of
(i) Minimaility (the minimal set of words to make the concept unique)
(ii) Coverage (The maximal set of words- ordered by frequency in the corpus- to include all possible words
standing for the sense)
(iii) Replacability (The example sentence should be such that the most frequent words in the synset can
replace one another in the sentence without altering the sense)
In the above example, {ghar, makaan} is the minimal set, the rest of the words cover the concept and the words "ghar, griha" and "makaan" can replace one another in the example sentences with minor changes to the sentence structure.
The semantic relations of "hyperonymy/hyponymy", "meronymy/holonymy", "antonymy", "gradations", "entailment" and "troponymy" were explained with examples; so was the importance of cross parts of speech linkages and the connection between the wordnet and the ontology.
Next day, Mr. Nitin Verma of IIT Bombay demonstrated the application of the wordnet in automatically creating document specific dictionaries. The words obtain their disambiguators and semantic attributes from the wordnet. Dr. Rajendran of Tamil University, Tanjavur described their effort on the construction of the Tamil wordnet. Using an ontology- motivated by Nida's concept classification- the Tamil wordnet is created through the following steps: (i) extraction of words from the dictionary, (ii) grouping of words into domains and sub-domains and (iii) arranging the groups hierarchically. Dr. Sudeshna Sarkar of IIT Kharagpur described their work on the Bengali wordnet and placed it in the context of their other language processing activities.
After this, all the participating language groups exchanged notes on an execise done on a 100 synset sample from the Hindi wordnet provided by IIT Bombay to all the language groups 3 months prior to the workshop. These 100 synsets cover all the parts of speech and were from major conceptual categories like natural object, action, quality etc. It was interesting to observe how words assume different shades across languages, how the glosses become tricky to create for commonly used terms, how words prefer collocations, how example sentences often are directly adaptable with minimal changes from one language to another "close" language. This was an extremely educative experience which clarified the methodology of wordnet construction. While this exercise was going on it was realized that the following need discussions:
(a) The ontology behind the wordnet.
(b) Compositioal approach to the construction of the wordnet.
(c) Culture specific considerations in the wordnet.
(d) Specialities of Indian verbs.
The above discussions were to be led by Dr. Rajendran of Tamil University Tanjavur, Dr. Uma Maheswar Rao of the University of Hyderabad, Dr. Lalita Handoo of CIIL Mysore and Dr. J.C. Sharma of CIIL Mysore respectively.
On the last day of the workshop, discussions continued around the sample synsets. Dr. Bhattacharyya emphasised that the glosses in the wordnet explicate the synset senses, but cannot really be encyclopedic, scientific or legal definitions. In explicating the senses, they are assisted by the members of the synset and also the accompanying example sentences. Since the gloss is used for linking and creating the synsets it was decided that
(a) the glosses will be short and simple.
(b) they will be expressed both in the specific language and in English.
(c) the example sentences also will be simple and precise; idiomatic and poetic expressions will be avoided.
HIGHLIGHTING AGAIN WAS DONE OF THE FACT THAT THE GLOSSES AND THE EXAMPLE SENTENCES WILL GIVE RISE TO MULITIWAY, PARALLEL CORPORA.
Following this, Dr. Rajendran described their work on ontology. The top ontologial categories are "things", "events", "abstracts" and "relationals" which correspond to "concrete nouns", "verbs", "adjectives and abstract nouns" and "postpositions and case markers" respectively. The details of the ontological categories at various levels were discussed. Dr. Uma Maheswar Rao described the componential approach to the wordnet creation. Introducing the interesting notion that "words are bundles of semantic features which are binary and parallel to those in phonetics", Dr. Rao proposed that a space of semantic features be designed and the words sharing ALL and ONLY a set of common semantic features be inserted into the same synset. He gave examples to illustrate this idea. Dr. Bhattacharyya suggested this highly interesting approach be worked out in detail especially for the "abstracts". He observed that the features- once detailed out- can be attached to the synsets of existing wordnets.
Verbs in Indian languages show some unique features like (i) conjunct verbs (ii) compound verbs (iii) causative formation {iv) pairings and (v) onomaetopia. Dr. J.C. Sharma of CIIL Mysore explicated these issues with clarity. He described the tests for conjunct verbs (nominal + verb) and compound verbs (verb + verb, with the second verb serving as the vector/explicator/intensifier). Not every nominal and verb combination qualifies as a conjunct verb; the whole unit must behave like a simple verb and the agreement must take place with entities OUTSIDE the combination. Dr. Bhattacharyya brought up the computational issue of storing the verbs in the wordnet. It was decided that
(a) Conjunct verbs will be lexicalised in the wordnet.
(b) Compounds and all the other phenomena will be dealt with by a separate morphological module serving as
the front end to the wordnet.
Dr Lalita Handoo showed with examples- especially from Kashmiri- how very culture specific concepts do not have their parallels in other languages. Their linkages with the synsets of other languages remains a question. A viable approach could be linking indirectly through the hyperonymy relation- suggested Dr. Bhattacharyya. Dr. Rajasri of CIIL said that the concepts could be classified as (i)universals across world's languages (ii)universals across Indian languages and (iii)those specific to individual languages. Initially the groups should concentrate on (i)and link with one another.
At the end of the workshop the following resolutions were adopted:
1. By the end of 2003 each Indian language will create a wordnet of 5000 synsets. These will be for
about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by
frequency- available with the CIIL.
2. The language specific wordnets {are being}/{will be} developed by the following institutions-
CIIL Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali
IIT Bombay: Hindi, Marathi and Konkani (in collaboration with the Goa research group for the
last mentioned)
AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam
University of Hyderabad: Telegu
IIT Kharagpur: Bengali
University of Baroda: Gujarati
Utkal University Bhubaneswar: Oriya
Reserach groups have to be identified for building the wordnets of Assamese, Nepali and Languages of the
North East.
3. Funds have to be generated for constructing the Integrated Indo-Wordnet at the national level.