2
N. Calzolari2Dottorato, Pisa, Maggio 2009 1) Because the main trend until mid-80s was to privilege the processing of critical phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language 1) Because the main trend until mid-80s was to privilege the processing of critical phenomena, studied by the dominating linguistic theories, rather than focusing on the deep analysis of the real uses of a language As a result CL was focusing on: As a result CL was focusing on: few examples - often artificially built lexicons made of few entries (toy lexicons) grammars with poor coverage 2) Because large-scale LRs are costly & their production requires a big organizing effort 2) Because large-scale LRs are costly & their production requires a big organizing effort Why such needed LRs, were lacking after 30 years of R&D in the field? Old slide with Antonio Zampolli (80s/early 90s) Why we still lack them??

3
N. Calzolari3Dottorato, Pisa, Maggio 2009 Early interest: To become machine-tractable To extract info from them – with much less powerful tools than now Precursor of the trend of automatic acquisition from corpora Acquilex (Pisa et al.) Work on/with Longman dictionary (Las Cruces) NSF & EC International Cooperation grant, NSF & EC International Cooperation grant, promoted by Wilks, Zampolli, Calzolari (Las Cruces & Pisa) Don Walker & Antonio Zampolli Work on Machine Readable Dictionaries: The beginnings… After many years of complete disregard – or even disdain and contempt – for LRs, due mainly to the prevalence and influence of the generativist school PioneeringResearch Historical notes

4
N. Calzolari4Dottorato, Pisa, Maggio 2009 … back from the 70s/80s It became evident that: Part of the results of meaning extraction, e.g. many meaning distinctions, which could be generalised over lexicographic definitions and automatically captured, were unmanageable at the formal representation level, and had to be blurred into unique features and values. Unfortunately, it is still today difficult to constrain word-meanings within a rigorously defined organization: by their very nature they tend to evade any strict boundaries Automatic acquisition of lexical information from MRDs Was at the centre of activities in Pisa group, Amsler, Briscoe, Boguraev, Wilks group, IBM, then Japanese groups, … The trend was: large-scale computational methods for the transformation of machine readable dictionaries (MRDs) into machine tractable dictionaries

5
N. Calzolari5Dottorato, Pisa, Maggio 2009 The lexicon has become ever more relevant Both international and national authorities started investing in the field as never before, interested in technologies & systems which are really working and are economically interesting The need of empirical methods, based on the analysis of large amount of data, has been recognized LRs must be robust enough for analysing the concrete uses of a language, either theoretically interesting or not After that pioneering era, production & use of adequate LRs strongly increased Data-driven approaches

6
N. Calzolari6Dottorato, Pisa, Maggio 2009 LRs have acquired larger resonance in the last 2 decades, when many activities, in Europe and world-wide, have contributed to substantial advances in knowledge and capability of how to represent, create, acquire, access, exploit, harmonise, tune, maintain, distribute, etc. large lexical and textual repositories In Europe an essential role was played by the EC, through initiativesNERCPAROLESIMPLEEuroWordNetEAGLESISLEELSNETRELATOR… that saw the participation of many EU groups, linked over the years by sharing common approaches and visions Since then …

7
N. Calzolari7Dottorato, Pisa, Maggio 2009 Automatic acquisition of info from texts: Automatic acquisition of info from texts: This trend has become today a consolidated fact, and we have moved from focusing on acquisition of linguistic information (as at the beginning) to broader acquisition of general knowledge, with more data intensive, robust, reliable methods … back from the late 80s After acquisition from MRDs,

8
N. Calzolari8Dottorato, Pisa, Maggio 2009 LRs give to NLP systems the knowledge needed for the various linguistic processing Realising that most of the needed information escapes individual introspection escapes individual introspection can only be acquired analysing large textual corpora attesting language use in different fields/communicative contexts can only be acquired analysing large textual corpora attesting language use in different fields/communicative contexts need of adequate models BUT need of adequate models to handle actual usage of language LRs as necessary infrastructure (Lexicons/Corpora) both for research & applications: Sub-product?: Importance of statistical methods Sub-product?: Importance of statistical methods Lesson: to large coverage changes to the models Going from core sets to large coverage has implications not just in quantitative terms, but more interestingly in terms of changes to the models and the strategies of processes We started building:

16
N. Calzolari16Dottorato, Pisa, Maggio 2009 1stOrderEntity Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space. 2ndOrderEntity Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply 3rdOrderEntity An unobservable proposition which exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, thought, information, theory, plan. EWN/IWN Ontology Top nodes

36
N. Calzolari36Dottorato, Pisa, Maggio 2009 The SIMPLE ontology Simple Ontology: multidimensional type hierarchy based on both multidimensional type hierarchy based on both hierarchical and non-hierarchical conceptual relations hierarchical and non-hierarchical conceptual relations from Nilda Ruimy In the SIMPLE ontology, types are not mere labels but the repository of a specific set of structured semantic information In the SIMPLE ontology, types are not mere labels but the repository of a specific set of structured semantic information

38
N. Calzolari38Dottorato, Pisa, Maggio 2009 Ontology of Structured Semantic Types: a Template Schema providing a set of structured information crucial to the definition of a semantic type Interface between ontology & lexicon Guide for the lexicographer

39
N. Calzolari39Dottorato, Pisa, Maggio 2009 Semantic type in the SIMPLE Ontology Not just a label but rather a classificatory device consisting of a cluster of structured semantic information distinguishing it by other senses of the same word expressing its similarity with other words Type assignment means endowing a word-sense with a structured set of semantic features and relations with a view to: expressing its relationships to other words drawing inferences from this information Each semantic type is associated to a template, i.e. a schematic structure that contains a cluster of type-defining properties and imposes constraints on lexical items for type membership Templates: interface between Ontology and Lexicon Template-driven encoding methodology ensures internal and cross-lexicons consistency from Nilda Ruimy

42
N. Calzolari42Dottorato, Pisa, Maggio 2009 Qualia Structure Consists of four qualia roles encoding orthogonal dimensions of meaning : formal role (general identification) constitutive role (composition) agentive role (origin) telic role (function) One of the four levels of semantic representation in the theory of Generative Lexicon

60
N. Calzolari60Dottorato, Pisa, Maggio 2009 Ontology & Lexicon Today we can easily say that ontology learning, i.e. the practical feasibility of supporting knowledge acquisition in a domain, depends on developing automatic methods for acquiring conceptual representations from natural language text Semantic Web initiatives are also focussing on the building of ontological representations from texts, and in this respect show a large amount of conceptual overlap with the notion of a dynamic lexicon Based on various experiences, and as a work strategy for lexical/textual resources We should push towards innovative types of lexicons: a sort of example- based living lexicons that participate of properties of both lexicons and corpora We should push towards innovative types of lexicons: a sort of example- based living lexicons that participate of properties of both lexicons and corpora In such a lexicon redundancy is not a problem, but rather a benefit In such a lexicon redundancy is not a problem, but rather a benefit Lexicon & Corpus

61
N. Calzolari61Dottorato, Pisa, Maggio 2009 Often a gap between advancement in LRs and LT Often a gap between advancement in LRs and LT Either adequate LRs are missing … or there are no systems able to use knowledge intensive LRs effectively Either adequate LRs are missing … or there are no systems able to use knowledge intensive LRs effectively Shortcomings: Shortcomings: lack of usable implementations fully exploiting new types of LRs lack of usable implementations fully exploiting new types of LRs LR claims are not empirically evaluated LR claims are not empirically evaluated BUT… Mismatch between LRs and LT A parallel evolution of R&D for both LRs and LT is needed

65
N. Calzolari65Dottorato, Pisa, Maggio 2009 … Dynamic lexicons Current computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries Current computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries Towards a flexible model of dynamic lexicon Towards a flexible model of dynamic lexicon extending the expressiveness of a core static lexicon extending the expressiveness of a core static lexicon adapting to the requirements of language in use as attested in corpora adapting to the requirements of language in use as attested in corpora with semantic clustering techniques, etc. with semantic clustering techniques, etc. Convert the extreme flexibility & multidimensionality of meaning into large-scale and exploitable (VIRTUAL?) resources a Lexicon & Corpus together Sort of Example-based Lexicon BUT

67
N. Calzolari67Dottorato, Pisa, Maggio 2009 Complexity of Word Sense in context: many potential clues A particular meaning (of a verb) may be selected by: A specific syntactic pattern A specific syntactic pattern comprendere + that-clause = to understand [not = to include] comprendere + that-clause = to understand [not = to include] aprire + PP introduced by a (preferably with human head) = to be ready, open, well disposed towards someone (e.g. Cossiga apre a La Malfa) aprire + PP introduced by a (preferably with human head) = to be ready, open, well disposed towards someone (e.g. Cossiga apre a La Malfa) The semantic type of subjects, dir objects, ind. objects The semantic type of subjects, dir objects, ind. objects human subject (if not collective type) always selects the meaning to understand of the verb comprendere human subject (if not collective type) always selects the meaning to understand of the verb comprendere The domain of use The domain of use perseguire un reato to prosecute a crime (domain=law) perseguire un reato to prosecute a crime (domain=law) A specific modifier A specific modifier perseguire penalmente to prosecute at the penal level, not to pursue (a goal) perseguire penalmente to prosecute at the penal level, not to pursue (a goal) comprendere benissimo to understand very well, not to include comprendere benissimo to understand very well, not to include Two different senses of a lemma cannot be selected simultaneously in the same context BUT… BUT…

68
N. Calzolari68Dottorato, Pisa, Maggio 2009 Complexity of Word Sense identification The problem: not sure tests not sure tests only partial validity & not completely discriminating only partial validity & not completely discriminating Moreover, its not easy to predict when to apply which test Moreover, its not easy to predict when to apply which test Word Sense Disambiguation (WSD) in different contexts is better achieved using info types at different levels of linguistic description: in different contexts is better achieved using info types at different levels of linguistic description: morphosyntactic/syntactic/semantic/pragmatic…, even multilingual morphosyntactic/syntactic/semantic/pragmatic…, even multilingual BUT a-priori unpredictable where is the clue BUT a-priori unpredictable where is the clue

69
N. Calzolari69Dottorato, Pisa, Maggio 2009 Complexity of Word Sense & use of Corpora The availability of large quantities of semantically tagged corpora helps to The availability of large quantities of semantically tagged corpora helps to analyse the impact of different clues to perform WSD in different contexts analyse the impact of different clues to perform WSD in different contexts study the interaction of clues belonging to different levels of linguistic description, to improve WSD strategies study the interaction of clues belonging to different levels of linguistic description, to improve WSD strategies not just statistics!! not just statistics!! Automatically acquire syntactic, semantic, collocational (lexical) indicators which can help in the identification of a word-sense which can help in the identification of a word-sense List them in the lexicon?? List them in the lexicon??

71
N. Calzolari71Dottorato, Pisa, Maggio 2009 … what cannot be easily encoded at the Lexical-Semantic Level In a Senseval framework … When sense interpretation requires appeal to extra-linguistic knowledge ( not to be captured at the lexical-semantic level of description) When corpus annotation either diverges from the lexical resource or further specifies it words acquiring a specific sense, strictly dependent on the context words acquiring a specific sense, strictly dependent on the context la donna Pauline Collins, che ha già visto arrestare il marito dai tedeschi,… variety of nuances of a verb, e.g. according to co-occurring dir.obj. sem-type variety of nuances of a verb, e.g. according to co-occurring dir.obj. sem-type metaphors extended to an entire sentence metaphors extended to an entire sentence lauto verde arriva sul tavolo del governo (lit. the green car arrives on the table of the government)...... Not all these shifts of meanings can/must be captured through lexical-semantic annotation e.g.

75
N. Calzolari75Dottorato, Pisa, Maggio 2009 Usual issues: Is there a fixed set of senses? or Do senses exist as separate objects? Criteria for sense distinction very application-dependent Criteria for sense distinction very application-dependent greater vs. lesser granularity depend on the task/ domain/situation/etc. greater vs. lesser granularity depend on the task/ domain/situation/etc. i.e. the communication purpose i.e. the communication purpose & there is no inherently true (upper or lower) limit to the granularity... Impossible a checklist theory of meaning: meaning as a piece of information with an autonomous status independent of its use Impossible a checklist theory of meaning: meaning as a piece of information with an autonomous status independent of its use Computational resources should provide multi-dimensional information multi-dimensional information the highest expressiveness in terms of sense-discriminating power the highest expressiveness in terms of sense-discriminating power contextual information contextual information Are we dealing with semantic annotation in the right way??

76
N. Calzolari76Dottorato, Pisa, Maggio 2009 Divergences betw. Lexicon encoding & Corpus annotation In the lexicon senses are de-contextualized (a necessity to capture generalizations) sense discrimination must be kept under control clustering (manually or automatically) In the corpus sense annotation task contextualization plays a predominant role calls for a range of pragmatic issues corpus analysis per se would lead to excessive granularity of sense distinctions Capture just the core basic distinctions in a core lexicon & Acquire additional, more granular info (usu. of collocational nature) from corpora to be encoded within the broader senses, e.g. to help translation not yet solved

77
N. Calzolari77Dottorato, Pisa, Maggio 2009 Between LRs and Linguistics: A consequence of the corpus-based approach is Compels to break hypotheses too easily taken for granted in mainstream linguistics Compels to break hypotheses too easily taken for granted in mainstream linguistics In actual usage a characteristics of language is to display many properties which behave as a continuum, not as yes/no properties In actual usage a characteristics of language is to display many properties which behave as a continuum, not as yes/no properties The same holds true for so-called rules: we find more frequently tendencies towards a rule than precise rules The same holds true for so-called rules: we find more frequently tendencies towards a rule than precise rules Many of the theoretical rules appear to be simplifications or idealisations in fact dispelled by real usage Many of the theoretical rules appear to be simplifications or idealisations in fact dispelled by real usage A number of dichotomies must then be reconciled A number of dichotomies must then be reconciled Lesson learned : [IN-]Adequacy of Lexical resources A long way to be able to recognise & integrate the many dimensions relevant to content interpretation