Article Structure

Abstract

In this paper, we investigate various strategies to predict both syntactic dependency parsing and contiguous multiword expression (MWE) recognition, testing them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

Introduction

A real-life parsing system should comprise the recognition of multiword expressions (MWEsl), first because downstream semantic-oriented applications need some marking in order to distinguish between regular semantic composition and the typical semantic non-compositionality of MWEs.

Related work

We gave in introduction references to previous work on predicting MWEs and constituency parsing.

Data: MWEs in Dependency Trees

The data we use is the SPMRL 13 dataset for French, in dependency format.

Architectures for MWE Analysis and Parsing

The architectures we investigated vary depending on whether the MWE status of sequences of tokens is predicted via dependency parsing or via an external tool (described in section 5), and this dichotomy applies both to structured MWEs and flat MWEs.

Use of external MWE resources

In order to better deal with MWE prediction, we use external MWE resources, namely MWE lexicons and an MWE analyzer.

Experiments

6.1 Settings and evaluation metrics

Conclusion

We experimented strategies to predict both MWE analysis and dependency structure, and tested them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

Topics

dependency parsing

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

In this paper, we investigate various strategies to predict both syntactic dependency parsing and contiguous multiword expression (MWE) recognition, testing them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

Page 1, “Abstract”

It is a less language-specific system that reranks n-best dependency parses from 3 parsers, informed with features from predicted constituency trees.

In some experiments, we make use of alternative representations, which we refer later as “labeled representation”, in which the MWE features are incorporated in the dependency labels, so that MWE composition and/or the POS of the MWE be totally contained in the tree topology and labels, and thus predictable via dependency parsing .

Page 4, “Data: MWEs in Dependency Trees”

The architectures we investigated vary depending on whether the MWE status of sequences of tokens is predicted via dependency parsing or via an external tool (described in section 5), and this dichotomy applies both to structured MWEs and flat MWEs.

Page 4, “Architectures for MWE Analysis and Parsing”

0 IRREG—BY—PARSER: the MWE status, flat topology and POS are all predicted via dependency parsing , using representations for training and parsing, with all information for irregular MWEs encoded in topology and labels (as for in vain in Figure 2).

Shared Task

Appears in 11 sentences as: Shared Task (11)

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

In this paper, we investigate various strategies to predict both syntactic dependency parsing and contiguous multiword expression (MWE) recognition, testing them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

Page 1, “Abstract”

While the realistic scenario of syntactic parsing with automatic MWE recognition (either done jointly or in a pipeline) has already been investigated in constituency parsing (Green et al., 2011; Constant et al., 2012; Green et al., 2013), the French dataset of the SPMRL 2013 Shared Task (Seddah et al., 2013) only recently provided the opportunity to evaluate this scenario within the framework of dependency syntax.2 In such a scenario, a system predicts dependency trees with marked groupings of tokens into MWEs.

Page 1, “Introduction”

In this paper, we investigate various strategies for predicting from a tokenized sentence both MWEs and syntactic dependencies, using the French dataset of the SPMRL 13 Shared Task .

Page 1, “Introduction”

2The main focus of the Shared Task was on predicting both morphological and syntactic analysis for morphologically-rich languages.

Page 1, “Introduction”

To our knowledge, the first works3 on predicting both MWEs and dependency trees are those presented to the SPMRL 2013 Shared Task that provided scores for French (which is the only dataset containing MWEs).

Page 2, “Related work”

(2013) proposed to combine pipeline and joint systems in a reparser (Sagae and Lavie, 2006), and ranked first at the Shared Task .

Page 2, “Related work”

It uses no feature nor treatment specific to MWEs as it focuses on the general aim of the Shared Task , namely coping with prediction of morphological and syntactic analysis.

Page 2, “Related work”

The Shared Task used an enhanced version of the constituency-to-dependency conversion of Candito et al.

Page 2, “Data: MWEs in Dependency Trees”

We compare these four architectures between them and also with two simpler architectures used by (Constant et al., 2013) within the SPMRL 13 Shared Task , in which regular and irregular MWEs are not distinguished:

Page 5, “Architectures for MWE Analysis and Parsing”

Moreover, we provide in table 5 a comparison of our best architecture with reg/irregular MWE distinction with other architectures that do not make this distinction, namely the two best comparable systems designed for the SPMRL Shared Task (Seddah et a1., 2013): the pipeline simple parser based on Mate-tools of Constant et al.

Page 8, “Experiments”

We experimented strategies to predict both MWE analysis and dependency structure, and tested them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

In the “labeled representation” evaluation, the UAS provides a measure of syntactic attachments for sequences of words, independently of the (regular) MWE status of subse-quences.

Page 7, “Experiments”

The UAS for labeled representation will be maximal, whereas for the flat representation, the last two tokens will count as incorrect for UAS .

Page 7, “Experiments”

the “labeled evaluation”, we obtain a LAS evaluation for the whole task of parsing plus MWE recognition, but an UAS evaluation that penalizes less errors on MWE status, while keeping a representation that is richer: predicted parses contain not only the syntactic dependencies and MWE information, but also a classification of MWEs into regular and irregular, and the internal syntactic structure of regular MWEs.

Page 7, “Experiments”

The evaluation on “structured representation” can be interpreted as an evaluation of the parsing task plus the recognition of irregular MWEs only: both LAS and UAS are measured independently of errors on regular MWE status (note the UAS is exactly the same than in the “labeled” case).

Page 7, “Experiments”

Concerning the three distinct representations, evaluating on structured representation (hence without looking at regular MWE status) leads to a rough 2 point performance increase for the LAS and a one point increase for the UAS , with respect to the evaluation against flat representation.

Page 8, “Experiments”

The evaluation on the labeled representation provides an evaluation of the full task (parsing, regular/irregular MWE recognition and regular MWEs structuring), with a UAS that is less impacted by errors on regular MWE status, while LAS reflects the full difficulty of the task.16

Page 8, “Experiments”

16The slight differences in LAS between the labeled and the flat representations are due to side effects of errors on MWE status: some wrong reattachments performed to obtain flat representation decrease the UAS , but also in some cases the LAS.

Treebank

Appears in 8 sentences as: Treebank (5) treebank (3)

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

In this paper, we investigate various strategies to predict both syntactic dependency parsing and contiguous multiword expression (MWE) recognition, testing them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

Page 1, “Abstract”

The French dataset is the only one containing MWEs: the French treebank has the particularity to contain a high ratio of tokens belonging to a MWE (12.7% of non numerical tokens).

Page 1, “Introduction”

Our representation also resembles that of light-verb constructions (LVC) in the hungarian dependency treebank (Vincze et al., 2010): the construction has regular syntax, and a suffix is used on labels to express it is a LVC (Vincze et al., 2013).

Page 2, “Related work”

It contains projective dependency trees that were automatically derived from the latest status of the French Treebank (Abeille and Barrier, 2004), which consists of constituency trees for sentences from the

Page 2, “Data: MWEs in Dependency Trees”

For instance, in the French Treebank , population active (lit.

Page 3, “Data: MWEs in Dependency Trees”

In order to compare the MWEs present in the lexicons and those encoded in the French treebank , we applied the following procedure (hereafter called lexicon

Page 5, “Use of external MWE resources”

We had to convert the DELA POS tagset to that of the French Treebank .

Page 5, “Use of external MWE resources”

We experimented strategies to predict both MWE analysis and dependency structure, and tested them on the dependency version of French Treebank (Abeille and Barrier, 2004), as instantiated in the SPMRL Shared Task (Seddah et al., 2013).

dependency trees

Appears in 7 sentences as: dependency tree (1) dependency trees (6)

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

While the realistic scenario of syntactic parsing with automatic MWE recognition (either done jointly or in a pipeline) has already been investigated in constituency parsing (Green et al., 2011; Constant et al., 2012; Green et al., 2013), the French dataset of the SPMRL 2013 Shared Task (Seddah et al., 2013) only recently provided the opportunity to evaluate this scenario within the framework of dependency syntax.2 In such a scenario, a system predicts dependency trees with marked groupings of tokens into MWEs.

Page 1, “Introduction”

The trees show syntactic dependencies between semantically sound units (made of one or several tokens), and are thus particularly appealing for downstream semantic-oriented applications, as dependency trees are considered to be closer to predicate-argument structures.

Page 1, “Introduction”

To our knowledge, the first works3 on predicting both MWEs and dependency trees are those presented to the SPMRL 2013 Shared Task that provided scores for French (which is the only dataset containing MWEs).

Page 2, “Related work”

It contains projective dependency trees that were automatically derived from the latest status of the French Treebank (Abeille and Barrier, 2004), which consists of constituency trees for sentences from the

Page 2, “Data: MWEs in Dependency Trees”

Figure 1: French dependency tree for L’abus de biens sociaux fut de’nonce’ en vain (literally the misuse of assets social was denounced in vain, meaning The misuse of corporate assets was denounced in vain), containing two MWEs (in red).

Page 2, “Data: MWEs in Dependency Trees”

In the dependency trees , there is no “node” for a MWE as a whole, but one node per MWE component (more generally one node per token).

Page 2, “Data: MWEs in Dependency Trees”

Our first motivation is to increase the quantity of information conveyed by the dependency trees , by distinguishing syntactic regularity and semantic regularity.

statistically significant

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

To evaluate statistical significance of parsing performance differences, we use eva107.pl14 with -b 0p-tion, and then Dan Bikel’s comparator.15 For MWEs, we use the Fmeasure for recognition of untagged MWEs (hereafter FUM) and for recognition of tagged MWEs (hereafter FTM).

Page 6, “Experiments”

For each architecture except the PIPELINE one, differences between the baseline and the best setting are statistically significant (p < 0.01).

Page 7, “Experiments”

Best JOINT has statistically significant difference (p < 0.01) over both best JOINT-REG and best PIPELINE.

Page 7, “Experiments”

We computed statistical significance of differences between our systems and Const13.

POS tagging

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

6We use the version available in the POS tagger MElt (Denis and Sagot, 2009).

Page 5, “Use of external MWE resources”

The MWE analyzer is a CRF-based sequential labeler, which, given a tokenized text, jointly performs MWE segmentation and POS tagging (of simple tokens and of MWEs), both tasks mutually helping each other9.

Page 6, “Use of external MWE resources”

The MWE analyzer integrates, among others, features computed from the external lexicons described in section 5.1, which greatly improve POS tagging (Denis and Sagot, 2009) and MWE segmentation (Constant and Tel-lier, 2012).

Page 6, “Use of external MWE resources”

9Note that in our experiments, we use this analyzer for MWE analysis only, and discard the POS tagging prediction.

part-of-speech

Appears in 3 sentences as: part-of-speech (3)

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

In gold data, the MWEs appear in an expanded flat format: each MWE bears a part-of-speech and consists of a sequence of tokens (hereafter the “components” of the MWE), each having their proper POS, lemma and morphological features.

Page 2, “Data: MWEs in Dependency Trees”

For flat MWEs, the only missing information is the MWE part-of-speech : we concatenate it to the dep_cpd labels.

Page 4, “Data: MWEs in Dependency Trees”

We tested to incorporate the MWE-specific features as defined in the gold flat representation (section 3.1): the mwehead=POS feature for the MWE head token, POS being the part-of-speech of the MWE; the component=y feature for the non-first MWE component.

regular expressions

In Strategies for Contiguous Multiword Expression Analysis and Dependency Parsing

MWEs are first classified as regular or irregular, using regular expressions over the sequence of parts-of-speech within the MWE.

Page 4, “Data: MWEs in Dependency Trees”

To define the regular expressions, we grouped gold MWEs according to the pair [global POS of the MWE + sequence of POS of the MWE components], and designed regular expressions to match the most frequent patterns that looked regular according to our linguistic knowledge.

Page 4, “Data: MWEs in Dependency Trees”

5 The six regular expressions that we obtained cover nominal, prepositional, adverbial and verbal compounds.