|work=Improved the extraction of semantic relations between textual elements as it is currently performed by STRING by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns (Nbp); e.g., O Pedro partiu uma perna ’Pedro broke a leg’: WHOLE-PART(Pedro,perna). In order to extract this type of whole-part relations, a rule-based meronymy extraction module has been built and integrated in the grammar of the STRING system. Around 17,000 Nbp instances were extracted from the first fragment of the CETEMPúblico corpus for the evaluation of this work. We also retrieved 79 instances of disease nouns (Nsick), which are derived from an underlying Nbp (e.g., gastrite-estômago ’gastritis-stomach’). In order to produce a golden standard for the evaluation, a random stratified sample of 1,000 sentences was selected, keeping the proportion of the total frequency of Nbp in the source corpus. This sample also includes a small number of Nsick (6 lemmas, 17 sentences). These instances were annotated by four native Portuguese speakers, and for 100 of them the inter-annotator agreement was calculated and was deemed from “fair” to “good”.

|work=Improved the extraction of semantic relations between textual elements as it is currently performed by STRING by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns (Nbp); e.g., O Pedro partiu uma perna ’Pedro broke a leg’: WHOLE-PART(Pedro,perna). In order to extract this type of whole-part relations, a rule-based meronymy extraction module has been built and integrated in the grammar of the STRING system. Around 17,000 Nbp instances were extracted from the first fragment of the CETEMPúblico corpus for the evaluation of this work. We also retrieved 79 instances of disease nouns (Nsick), which are derived from an underlying Nbp (e.g., gastrite-estômago ’gastritis-stomach’). In order to produce a golden standard for the evaluation, a random stratified sample of 1,000 sentences was selected, keeping the proportion of the total frequency of Nbp in the source corpus. This sample also includes a small number of Nsick (6 lemmas, 17 sentences). These instances were annotated by four native Portuguese speakers, and for 100 of them the inter-annotator agreement was calculated and was deemed from “fair” to “good”.

Revision as of 17:24, 3 February 2015

Coordination

Nuno J. Mamede received his graduation, MSc and PhD degrees in Electrical and Computer Engineering by the Instituto Superior Técnico, Lisbon, in 1981, 1985 and 1992, respectively. In 1982 he started as lecturer and since 2006 he holds a position of Associate Professor in Instituto Superior Técnico, where he has taught Digital Systems, Object Oriented Programming, Programming Languages, Knowledge Representation and Natural Language Processing. He is a proud member of Spoken Language Systems Lab (L²F) since its creation. His current research interests are Natural Language Text Processing and Computer Aided Language Learning. He is the PI of the REAP.PT project, and he is also involved in the project OOBIAN.

Jorge Baptista graduated in Languages and Literatures-Portuguese Studies and obtained a MA in Portuguese Linguistics from Faculdade de Letras da Univ. Lisboa with a thesis on nominal compounding and electronic dictionaries. His PhD thesis on predicative nouns and nominalization of adjectival constructions was presented to the Univ. Algarve, where he started working in 1992. He is an invited researcher of Spoken Language Systems Lab (L²F) since 2005. His current research interests in NLP concern Information Extraction (Named Entity Recognition and Relation Extraction), Parsing, and Computer Assisted Language Learning; in Linguistics proper, he's working on the Lexicon-Syntax description of verbal, adverbial and idiomatic constructions. He is also involved in the REAP.PT project and OOBIAN projects.

XEROX/XRCE Liaison and Collaborator

Caroline Hagège is a research engineer since July 2001 and works in the Parsing and Semantics group of XEROX Research Center Europe, Grenoble, France, mainly on robust and deep parsing and in the bridge between syntax and semantics. She holds a PhD (Doctorat) in Computational Linguistics from the [www.univ-bpclermont.fr/ University Blaise Pascal], Clermont-Ferrand, France, which was done at the GRIL laboratory (Université Blaise Pascal). Before joining XRCE, she was a researcher at the L2F laboratory (INESC-Id, Lisboa, Portugal), where she was involved in Portuguese robust language processing (morphology and shallow parsing). She was key to the initial development of the Portuguese grammar for XIP, and has ever worked as liaison with XEROX/XRCE. She has intensively collaborated in the development of the Named Entity Recognition module of STRING, particularly in the time expressions grammar, having join forces with us in the proposal and successful participation in the TIMEX evaluation track of the Second HAREM (2008), where STRING was assessed as the best NER-TIMEX system.

Team members

Joana Pinto (2015-present)

Improve the MARv4 module to perform a morphosyntatic selection taking into account not only the category and subcategory fields but also the mode, tense, person and number of the verb forms. [not yet]

José Correia (2014-present)

Develop a tool that shows the co-occurence between words from portuguese corpora. [not yet]

Generate new exercises for the REAP.PT platform about transformation from active to passive sentences and in the opposite sense. [not yet]

Amanda Rassi (2014-present)

Describe the predicative nouns built with the support verb "ter” (to have) in Brazilian Portuguese, collected from PLN.Br corpus and other sources, and linguistically formalized into Lexicon-Grammar tables, representing their formal (structural and transformational) and semantic (distributional) properties. The data will be integrated in the STRING system in order to improve parsing, particularly dependency extraction and event representation. [1, 2]

Cristina Turati (2013-present)

Describe the predicative nouns built with the support verb "dar” (to give) in Brazilian Portuguese, collected from PLN.Br corpus and other sources, and linguistically formalized into Lexicon-Grammar tables, representing their formal (structural and transformational) and semantic (distributional) properties. The data will be integrated in the STRING system in order to improve parsing, particularly dependency extraction and event representation. [1]

Francisco Dias (2014-present)

Automatic generation of rules to identify frozen expressions. Develop different text anonymization methods applied to translation tasks in order to reduce the need of manual redaction by humans. These texts should be de-identificated before being exposed and translated by humans and must then be re-identificated in the same text represented in another language. The anonymization task will take place in cooperation with Unbabel Lda. [not yet]

Ilia Markov (2013-2014)

Improved the extraction of semantic relations between textual elements as it is currently performed by STRING by targeting whole-part relations (meronymy), that is, a semantic relation between an entity that is perceived as a constituent part of another entity, or a member of a set. In this case, we focus on the type of meronymy involving human entities and body-part nouns (Nbp); e.g., O Pedro partiu uma perna ’Pedro broke a leg’: WHOLE-PART(Pedro,perna). In order to extract this type of whole-part relations, a rule-based meronymy extraction module has been built and integrated in the grammar of the STRING system. Around 17,000 Nbp instances were extracted from the first fragment of the CETEMPúblico corpus for the evaluation of this work. We also retrieved 79 instances of disease nouns (Nsick), which are derived from an underlying Nbp (e.g., gastrite-estômago ’gastritis-stomach’). In order to produce a golden standard for the evaluation, a random stratified sample of 1,000 sentences was selected, keeping the proportion of the total frequency of Nbp in the source corpus. This sample also includes a small number of Nsick (6 lemmas, 17 sentences). These instances were annotated by four native Portuguese speakers, and for 100 of them the inter-annotator agreement was calculated and was deemed from “fair” to “good”. [MSc Dissertation, 1, 2, 3, 4]

Pedro Curto (2013-2014)

Developed a system that automatically classifies text readability for European Portuguese, while highlighting the key challenges on language features’ selection and text classification. The system uses STRING to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, the system extracts 52 features grouped in 7 groups: parts-of-speech (POS), syllables, words, chunks and phrases, averages and frequencies, and some extra features. A classifier was created using these features and a corpus, previously annotated by readability level, using a five-level language classification official standard for Portuguese as Second Language. In a five-level (from A1 to C1) and three level (A, B and C) scenarios, the best-performing learning algorithm (LogitBoost) yields 79.25% and 86.32%, respectively. [MSc Dissertation, 1]

Gonçalo Suissas (2013-2014)

Developed a module which can choose adequately the precise sense that a verb features in a given sentence, from among other, potential different meanings. Various methods used in supervised classification that can be adopted on VSD were tested in several scenarios to determine the impact of different features. The baseline accuracy of 63.86% results from the most frequent sense (MFS) for each verb lemma in a set of 24 verbs. Among the ML techniques tested, the best method was the Naive Bayes algorithm, which achieved an accuracy of 67.71%, a gain of 3.85% above the baseline. [MSc Dissertation]

Rita Policarpo (2013-present)

Use of machine learning algorithms for semantic classification of names, based on a manual classification of the 5000 most common names and a set of hierarchical labels already defined. [not yet]

Implement a repository of morphological entities and develop an interface to manipulate them without requiring knowledge about the structure of the repository. [not yet]

Rui Santos (2012-2014)

A general-purpose, consensual set of 37 semantic roles was defined, based on a survey of the relevant related work, and using highly reproducible properties. A set of annotation guidelines was also built, in order to clarify how semantic roles should be assigned to verbal arguments in context. A SRL module was built and integrated in STRING. For this module, the information from a lexicon-syntactic database, ViPEr, which contains the relevant linguistic information for more than 6,000 European Portuguese full (or lexical, or distributional) verbs, was used and the database manually enriched with the information pertaining to the semantic roles of all verbal arguments. The SRL module is composed of 183 pattern-matching rules for labeling of subject (N0), first (N1) and second (N2) essential complements of verbal constructions and also allows the attribution of SR to other syntactic slots in the case of time, locative, manner, instrumental, comitative and other complements (both essential and circumstantial). This module was tested in a small corpus that was specifically annotated for this purpose. After this manual annotation, the corpus containing 655 semantic roles was used as a golden standard for automatic comparison with the system’s output. [MSc Dissertation, 1]

João Marques (2012-2013)

A co-referential, pronominal anaphora resolution module in Portuguese was developed and incorporated in the L2F’s NLP chain, STRING. This work also intends to improve the efficiency of the module currently in use, developed by Nuno Nobre, whose evaluation produced 33.5% f-measure results. To do so, we annotate a quite heterogeneous corpus, being composed of texts from different genres: novels, pieces of news, magazine news and newspaper columns, among others. In total, it contains 290,000 tokens and the annotation campaign produced 9,268 anaphoras. The strategy adopted was based in the identification of anaphors and candidates through a rule system; and in the selection of the most probable candidate for antecedent by a model built based on the (machine learning) algorithm Expectation-Maximization (EM). The system’s evaluation showed a significant improvement of ARM’s 2.0 performance, with a f-measure of 82% on the anaphor identification, 70% on the candidate identification to antecedent and 54% on anaphora resolution. [MSc Dissertation]

Viviana Cabrita (2012-2014)

Introduced a new module to perform event ordering. Only events in the same sentence are considered and 2 types of event ordering (before and simultaneously) computed. Every missing or unknown order would be saw as an unknown type, lacking a visual representation. It was also implemented a visual representation of the solution, so it would be easier to be interpreted by the human eye. [MSc Dissertation]

Alexandre Vicente (2011-2013)

This work allowed the union of the tokenization module and the morphological analysis in a single module, LexMan, using transducers. With this change, it was possible to transfer morpho-syntactic, context-independent, joining rules (for compound identification), previously implemented in the chain’s morphosyntactic disambiguator, RuDriCo to the LexMan module. The information used in the generation of the dictionary transducer can now be complemented also by derivational information, making possible to recognise prefixed-derived words, particularly neologisms. [MSc Dissertation]

Tiago Travanca (2011-2013)

This work addresses the problem of Verb Sense Disambiguation (VSD) in European Portuguese. Verb Sense Disambiguation is a sub-problem of the Word Sense Disambiguation (WSD) problem, that tries to identify in which sense a polissemic word is used in a given sentence. Thus a sense inventory for each word (or lemma) must be used. For the VSD problem, this sense inventory consisted in a lexicon- syntactic classification of the most frequent verbs in European Portuguese (ViPEr). Two approaches to VSD were considered. The first, rule-based, approach makes use of the lexical, syntactic and semantic descriptions of the verb senses present in ViPEr to determine the meaning of a verb. The second approach uses machine learning with a set of features commonly used in the WSD problem to determine the correct meaning of the target verb. Both approaches were tested in several scenarios to determine the impact of different features and different combinations of methods. The baseline accuracy of 84%, resulting from the most frequent sense for each verb lemma, was both surprisingly high and hard to surpass. Still, both approaches provided some improvement over this value. The best combination of the two techniques and the baseline yielded an accuracy of 87.2%, a gain of 3.2% above the baseline.. [MSc Dissertation]

Filipe Carapinha (2011-2013)

Development of a slot-filling (SL) module, to be integrated in the STRING system. The slot-filling task is an Information Retrieval challenge that consists in aggregating all information associated to a given named entity (NE) in a predefined template of relations and attributes. For now, this module will deal with the PERSON and ORGANIZATION NE types, already implemented in STRING (Oliveira 2010). An already XIP-implemented Relation Extraction module (Santos 2010) will be used to map those relations onto the corresponding slots. Since both these modules rely on accurate Anaphora Resolution (AR), the already existing AR module (Nobre 2011) will be integrated in STRING in order to improve the quantity and quality of the extracted information. [MSc Dissertation]

Cláudio Diniz (2009-present)

Implementation of RuDriCo2, a morphological rule-based disambiguator that can change the segmentation of the text, improving the former version of this module, its rules' syntax and optimizing the system’s main algorithm. Implementation of LexMan, a Lexical Analizer based in finite-state transducers, which is able to associate to the text tokens all the relevant morpho-syntactic information for its further processing. It uses a rich and highly granular tag set, adapted from the PAROLE project, and featuring 12 part-of-speech categories and 11 fields. LexMan replaced a previous module of of the NLP chain, Palavroso. LexMan is used to generate and validate all the inflected forms associated to lexical lemmas, along with the corresponding morpho-syntactic information. To this end the conversion of previous lexical resources was necessary. LexMan has much improved the performance of the STRING chain and it also provides an efficient, fast and ductile way of maintaining and updating the lexicons.The new MARv4, a statistical part-of-speech tagger, whose function is to choose the most likely POS tag for each word, using HMMs. The language model used by MARv4 is trained on a 250K Portuguese corpus originally produced under project PAROLE. To train MARv4 and fine-tune post-tagger rule-base disambiguation module of STRING, the training corpus underwent an extensive and systematic revision. For this end, scripts have been produced to ensure consistency, fast access and corpus maintenance. Development of the demo web interface for STRING. [MSc Dissertation, 2, 3, 4, 5]

Vera Cabarrão (2010-present)

Currently doing her MA thesis in Universidade de Lisboa - Faculdade de Letras. Been in the team since 2010. Corpus notation for Named Entity Recognition (NER), Relation Extraction, Anaphora Resolution, and Time Expressions. Writing rules in XIP for the identification of Natural Events (“tsunamis”, “earthquakes”), and Organized Events (political, scientific, artistic, and other) as NE. Writing dependency rules in XIP for Relation Extraction, namely Lifetime (e.g., relations regarding the events Birth, Death, Education), Business (e.g., Job, Foundation, Owner), and Location relations. Analysis of Portuguese newspapers to test the correct identification and annotation of NE by XIP. Improvement of the XIP lexicons. Contribution to improve the "Classification directives for named entities in Portuguese texts" and the "Classification directives for Relations between NE". [1]

Munshi Asadullah (2010-2012)

A heuristic based modeling of data from two different parsers namely Constraint Grammar (CG) based parser PALAVRAS and Phrase Structure Grammar (PSG) based Finite-State Parser (FSP) used as the parsing backbone of the STRING Natural Language Processing (NLP) chain for Portuguese is proposed. Different models using two parser output will be produced and put together in a linear combination for performance maximization. For the development of the research, a processing framework is also proposed and its development is presented. A dependency annotation tool is also developed within the scope of the research. The models performance was satisfactory if not extraordinary, although the primary objective was to present the modeling possibilities rather than the absolute performance.. [MSc Dissertation]

A module for TIMEX (time expressions) processing. TIMEX is part of the Named Entitity Recognition (NER) task. This new TIMEX module aims to identify, classify and normalize temporal expressions contained in a Portuguese written text. The TIMEX classification guidelines, adopted for the he participation in the Second HAREM Joint Evaluation Campaign were extended and adapted to identify more complex types of TIMEX. The TIMEX processing module was developed, evaluated and integrated in STRING. [MSc Dissertation]

Diogo Oliveira (2009-2011)

Improvement of the Named Entity Recognition (NER) module of STRING, especially for the HUMAN, LOCATION and AMOUNT categories, with reference to the performance attained during the Second HAREM Joint Evaluation Campaign (2008). A new set of delimitation and classification directives has been proposed to replace those used in the Second HAREM. Several improvements were introduced in the NLP chain, specially in the XIP syntactic parser, which is responsible for named entity extraction. Finally, the system performance has been evaluated, and a general trend of improvement has been confirmed. [MSc Dissertation]

Ricardo Portela (2008-2011)

Identification of multiword expresssions (MWE) in Portuguese. MWE are sequences of words whose meaning can not be calculated from the composition of the literal meaning of its individual words, so that together they acquire figurative/idiomatic/non-compositional meaning. Several collocation-based statistical methods were used to improve MWE extraction. The STRING was used to test linguistically motivated criteria against the syntactic dependencies extracted from the entire CETEMPúblico corpus and the semantic features associated to its lexicon. Procedures for the processing of large-scale corpora using the L2F GRID parallel computing, as well as scheduling and parallel programming software, were implemented and the results were evaluated from different perspectives. [MSc Dissertation, 2]

Daniel Santos (2009-2010)

Implementation of a Information Retrieval (IR) rule-based module for Relation Extraction (RE), specifically built to maximize the information retrieved. A set of directives for relation identification and annotation were defined, inspired on the work already developed for Portuguese and English. At this stage, FAMILY relations (spouse, parent, sibling, etc.), LIFETIME relations (date-of-birth or death), BUSINESS relations (employee, client, owner, etc.) and LOCATION (people ad organization) relations. An evaluation corpus was selected and annotated by a linguistic, in order to perform a more independent evaluation, thus allowing a better analysis of the results. [MSc Dissertation]

Nuno Nobre (2009-2011)

Implementation of an Anaphora Resolution (AR) post-processing module, that operates on the output of the XIP parser. This module deals with pronominal anaphora, that is, correference (or identity) relation between pronouns (the anaphor) and a previous mention of the same entity in discourse (the antecedent). At this stage, third-person, personal and possessive pronouns, as well as relative and demonstrative pronouns (in headless NPs) are resolved. During the system development a manual annotation tool was created, allowing to enrich text with anaphoric information quickly. The system was evaluated on a corpus that had been manually annotated by a linguist using that annotation tool, and it presented an f-measure of 33.5%. [MSc Dissertation]

Fernando Gomes (2008-2009)

Validation over a corpus of lexical-syntactical matrices, i.e. formal descriptions of the linguistic properties associated to lexical items, is a difficult and time-consuming task, but essential if such information is to be used in several NLP tasks. The validation is based on a statistical comparison between results obtained from a large corpus using STRING and the information contained in the matrices. This information consists in morphological, distributional and transformational properties of lexical items, and each of them must be verified individually through distinct processes. The statistical comparison is done with the aid of GRID computing, as well as scheduling and parallel programming software. Finally an evaluation of the work has been performed to check the findings. [MSc Dissertation]

Improvement of the MARv's performance, a statistical part-of-speech tagger, whose function is to choose the most likely POS tag for each word, using the Viterbi algorithm. namely its processing time and memory management, and the reduction of its error rate when disambiguating. The new MARv2, The new implementation of MARv2 increased its precision by 23.72% and it is significantly (9 times) faster that the previous version . Furthermore, it does not discard rejected tags and uses the same DTD than RuDriCo2. [MSc Dissertation]

João Loureiro (2006-2007)

A Named Entity Recognition (NER) module for the categories Work-of-Art (Obra), Value (Valor), Family relations (Relações de Parentesco) and Time (Tempo), for Portuguese. A first attempt to normalize time expressions such as dates ("24 de Novembro de 2005") and other productive phrases ("no próximo dia", next day). Time normalization is about converting time expressions’ values to a standard format allowing this information to be shared between different systems. [MSc Dissertation, 2]

Luís Romão (2006-2007)

Development of the Named Entity Recognition (NER) module, focusing on the categories LOCATION, ORGANIZATION, PEOPLE and EVENTs using STRING. In a rule-based approach to this NLP task, NE are identified based solely on the information in the lexicons and manually-built rules, either contextual or based on the entity’s structure. The system was evaluated according to the criteria defined by the First HAREM, a NER joint evaluation campaign for Portuguese. Results were in general above average when compared to other participant systems, obtaining the best results in the identification of ORGANIZATIONS and the best global results in several of the classification evaluation scenarios. [MSc Dissertation, 2]

Telmo Machado (2006-2007)

Development of an information extraction system in a specific domain, the cooking domain. The system is composed of three modules: pre-processing, recipe processing and output transformation. The main objectives are to identify ingredients and their associated quantities, and to identify the different tasks needed to prepare each recipe, as well as the utensils needed and the ingredients used in each of these tasks. After identifying the ingredients and the tasks of each recipe, these are introduced into a database. This database is supported by an ontology, which contains a description of the concepts used in the culinary domain. [MSc Dissertation]

Development of the RuDriCo system, which is an evolution from the PAsMo (post-morphologic analyzer) [Faiza 99]. RuDriCo adapts the output of the morphological analyzer to the specific needs of each parser. The modifications the system produces include: (i) segmentation changes; (ii) changes to the information added to the words tagged by the morphologic analyzer; (iii) changes in the output format of the morphologic analyzer, so that it be adequate to the format required by the parser. All these modifications are expressed by way of transformation declarative rules, based on the concept of pattern matching. [Graduation Thesis]

Contributed to the implementation of the first version of the chain. Developed the MARv module, still used by String. Provided specific support for the development of the next MARv versions and general support throughout the development of the chain. Was part of the team that developed the corpus used for training and testing MARv in the LE-PAROLE project. [MSc Dissertation,2]