We propose Bilingually-constrained Recursive Auto-encoders (BRAE) to learn semantic phrase embeddings (compact vector representations for phrases), which can distinguish the phrases with different semantic meanings.

Introduction

The models using word embeddings as the direct inputs to DNN cannot make full use of the whole syntactic and semantic information of the phrasal translation rules.

We isolate three ways in which word embeddings might augment a state-of-the-art statistical parser: by connecting out-of-vocabulary words to known ones, by encouraging common behavior among related in-vocabulary words, and by directly providing features for the lexicon.

Abstract

Despite small gains on extremely small supervised training sets, we find that extra information from embeddings appears to make little or no difference to a parser with adequate training data.

Introduction

This paper investigates a variety of ways in which word embeddings might augment a constituency parser with a discrete state space.

More importantly, we show empirically that word embeddings and word clusters capture different information and their combination would further improve the adaptability of relation extractors.

Related Work

Although word embeddings have been successfully employed in many NLP tasks (Collobert and Weston, 2008; Turian et al., 2010; Maas and Ng, 2010), the application of word embeddings in RE is very recent.

Specifically, our training encourages word embeddings to be consistent across alignment directions by introducing a penalty term that expresses the difference between embedding of words into an objective function.

RNN-based Alignment Model

In the lookup layer, each of these words is converted to its word embedding, and then the concatenation of the two embeddings (any) is fed to the hidden layer in the same manner as the FFNN-based model.

Related Work

Word embeddings are dense, low dimensional, and real-valued vectors that can capture syntactic and semantic properties of the words (Bengio et al., 2003).

Training

The constraint concretely enforces agreement in word embeddings of both directions.

Training

The proposed method trains two directional models concurrently based on the following objective by incorporating a penalty term that expresses the difference between word embeddings:

Training

where QFE (or 6gp) denotes the weights of layers in a source-to-target (or target-to-source) alignment model, 6,; denotes weights of a lookup layer, i.e., word embeddings , and 04 is a parameter that controls the strength of the agreement constraint.

We present a novel technique for semantic frame identification using distributed representations of predicates and their syntactic context; this technique leverages automatic syntactic parses and a generic set of word embeddings .

Experiments

The second baseline, tries to decouple the WSABIE training from the embedding input, and trains a log linear model using the embeddings .

Experiments

Hyperparameters For our frame identification model with embeddings , we search for the WSABIE hyperparameters using the development data.

Frame Identification with Embeddings

First, we extract the words in the syntactic context of runs; next, we concatenate their word embeddings as described in §2.2 to create an initial vector space representation.

Frame Identification with Embeddings

Formally, let cc represent the actual sentence with a marked predicate, along with the associated syntactic parse tree; let our initial representation of the predicate context be Suppose that the word embeddings we start with are of dimension n. Then 9 is a function from a parsed sentence cc to Rm“, where k is the number of possible syntactic context types.

Frame Identification with Embeddings

So for example “He runs the company” could help the model disambiguate “He owns the company.” Moreover, since g(:c) relies on word embeddings rather than word identities, information is shared between words.

Overview

We present a model that takes word embeddings as input and learns to identify semantic frames.

Overview

We use word embeddings to represent the syntactic context of a particular predicate instance as a vector.

Hellinger PCA embeddings learnt using the framework show competitive results on empirical tasks.

Introduction

While word embeddings and language models from such methods have been useful for tasks such as relation classification, polarity detection, event coreference and parsing; much of existing literature on composition is based on abstract linguistic theory and conjecture, and there is little evidence to support that learnt representations for larger linguistic units correspond to their semantic meanings.

Introduction

While this framework is attractive in the lack of assumptions on representation that it makes, the use of distributional embeddings for individual tokens means

Introduction

Recent work (Lebret and Lebret, 2013) has shown that the Hellinger distance is an especially effective measure in learning distributional embeddings , with Hellinger PCA being much more computationally inexpensive than neural language modeling approaches, while performing much better than standard PCA, and competitive with the state-of-the-art in downstream evaluations.

These generally consist of a projection layer that maps words, sub-word units or n-grams to high dimensional embeddings ; the latter are then combined component-wise with an operation such as summation.

Convolutional Neural Networks with Dynamic k-Max Pooling

Word embeddings have size d = 4.

Convolutional Neural Networks with Dynamic k-Max Pooling

The values in the embeddings wi are parameters that are op-timised during training.

Experiments

The set of parameters comprises the word embeddings , the filter weights and the weights from the fully connected layers.

Experiments

As the dataset is rather small, we use lower-dimensional word vectors with d = 32 that are initialised with embeddings trained in an unsupervised way to predict contexts of occurrence (Turian et al., 2010).

Experiments

The randomly initialised word embeddings are increased in length to a dimension of d = 60.