Topics

LDA

We do this by inferring topics in the training corpus by estimating the latent Dirichlet allocation ( LDA ) model (Blei et al., 2003)).

Page 4, “Structured Regularizers for Text”

Note that LDA is an unsupervised method, so we can infer topical structures from any collection of documents that are considered related to the target corpus (e. g., training documents, text from the web, etc.).

Page 4, “Structured Regularizers for Text”

In our experiments, we choose the R most probable words given a topic and create a group for them.6 The LDA regular-

Page 4, “Structured Regularizers for Text”

The LDA regularizer will construct four groups from these topics.

Page 4, “Structured Regularizers for Text”

Unlike the parse tree regularizer, the LDA regularizer is not tree structured.

parse tree

We introduce three linguistically motivated structured regularizers based on parse trees , topics, and hierarchical word clusters for text categorization.

Page 1, “Abstract”

Figure 1: An example of a parse tree from the Stanford sentiment treebank, which annotates sentiment at the level of every constituent (indicated here by —|— and ++; no marking indicates neutral sentiment).

We introduce a new regularizer, the parse tree regularizer, in which groups are defined for every constituent in every parse of a training data sentence.

Page 3, “Structured Regularizers for Text”

coefficients and A) for one sentence with the parse tree shown in Figure 1 is: Qtree =

Page 3, “Structured Regularizers for Text”

Of course, in a corpus there are many parse trees (one per sentence, so the number of parse trees is the number of sentences).

Page 3, “Structured Regularizers for Text”

Note that, since each word token is itself a constituent, the parse tree regularizer includes terms just like the lasso naturally, penalizing the absolute value of each word’s weight in isolation.

Page 3, “Structured Regularizers for Text”

Of course, in some sentences, some words will occur more than once, and the parse tree regularizer instantiates groups for constituents in every sentence in the training corpus, and these groups may work against each other.

Page 3, “Structured Regularizers for Text”

The parse tree regularizer should therefore

Page 3, “Structured Regularizers for Text”

In sentence level prediction tasks, such as sentence-level sentiment analysis, it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).

sentiment analysis

We show that our structured regularizers consistently improve classification accuracies compared to standard regularizers that penalize features in isolation (such as lasso, ridge, and elastic net regularizers) on a range of datasets for various text prediction problems: topic classification, sentiment analysis , and forecasting.

Page 1, “Abstract”

For tasks like text classification, sentiment analysis , and text-driven forecasting, this is an open question, as cheap “bag-of-words” models often perform well.

Page 1, “Introduction”

In sentence level prediction tasks, such as sentence-level sentiment analysis , it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).

Page 4, “Structured Regularizers for Text”

Sentiment analysis .

Page 6, “Experiments”

One task in sentiment analysis is predicting the polarity of a piece of text, i.e., whether the author is favorably inclined toward a (usually known) subject of discussion or proposition (Pang and Lee, 2008).

Page 6, “Experiments”

Sentiment analysis , even at the coarse level of polarity we consider here, can be confused by negation, stylistic use of irony, and other linguistic phenomena.

For the Brown cluster regularizers, we ran Brown clustering17 on training documents with 5, 000 clusters for the topic classification and sentiment analysis datasets, and 1, 000 for the larger text forecasting datasets (since they are bigger datasets that took more time).

We empirically showed that models regularized using our methods consistently outperformed standard regularizers that penalize features in isolation such as lasso, ridge, and elastic net on a range of datasets for various text prediction problems: topic classification, sentiment analysis , and forecasting.

For tasks like text classification, sentiment analysis, and text-driven forecasting, this is an open question, as cheap “bag-of-words” models often perform well.

Page 1, “Introduction”

We embrace the conventional bag-of-words representation of text, instead bringing linguistic bias to bear on regularization.

Page 1, “Introduction”

Our experiments demonstrate that structured regularizers can squeeze higher performance out of conventional bag-of-words models on seven out of eight of text categorization tasks tested, in six cases with more compact models than the best-performing unstructured-regularized model.

Page 1, “Introduction”

Overall, our results demonstrate that linguistic structure in the data can be used to improve bag-of-words models, through structured regularization.

Page 9, “Related and Future Work”

Our experimental focus has been on a controlled comparison between regularizers for a fixed model family (the simplest available, linear with bag-of-words features).

hyperparameter

Appears in 6 sentences as: hyperparameter (6)

In Linguistic Structured Sparsity in Text Categorization

Both methods disprefer weights of large magnitude; smaller (relative) magnitude means a feature (here, a word) has a smaller effect on the prediction, and zero means a feature has no effect.2 The hyperparameter A in each case is typically tuned on a development dataset.

Page 2, “Notation”

where Aglas is a hyperparameter tuned on a development data, and Ag is a group specific weight.

Page 2, “Group Lasso”

As a result, besides Aglas , we have an additional hyperparameter , denoted by Alas.

Page 3, “Structured Regularizers for Text”

Since the lasso-like penalty does not occur naturally in a non tree-structured regularizer, we add an additional lasso penalty for each word type (with hyperparameter Alas) to also encourage weights of irrelevant words to go to zero.

Page 4, “Structured Regularizers for Text”

Similar to the parse tree regularizer, for the lasso-like penalty on each word, we tune one group weight for all word types on a development data with a hyperparameter Alas.

treebank

Appears in 6 sentences as: treebank (6)

In Linguistic Structured Sparsity in Text Categorization

Figure 1: An example of a parse tree from the Stanford sentiment treebank , which annotates sentiment at the level of every constituent (indicated here by —|— and ++; no marking indicates neutral sentiment).

Page 3, “Structured Regularizers for Text”

The Stanford sentiment treebank has an annotation of sentiments at the constituent level.

Page 3, “Structured Regularizers for Text”

Figure 1 illustrates the group structures derived from an example sentence from the Stanford sentiment treebank (Socher et al., 2013).

Page 3, “Structured Regularizers for Text”

(2013) when annotating phrases in a sentence for building the Stanford sentiment treebank .

sentence-level

This regularizer captures the idea that phrases might be selected as relevant or (in most cases) irrelevant to a task, and is expected to be especially useful in sentence-level prediction tasks.

Page 3, “Structured Regularizers for Text”

In sentence level prediction tasks, such as sentence-level sentiment analysis, it is known that most constituents (especially those that correspond to shorter phrases) in a parse tree are uninformative (neutral sentiment).

Page 4, “Structured Regularizers for Text”

The task is to predict sentence-level sentiment, so each training example is a sentence.

Page 7, “Experiments”

It has been shown that syntactic information is helpful for sentence-level predictions (Socher et al., 2013), so the parse tree regularizer is naturally suitable for this task.

We leave comparison with other semi-supervised methods for future work.

Page 4, “Structured Regularizers for Text”

Note that we ran Brown clustering only on the training documents; running it on a larger collection of (unlabeled) documents relevant to the prediction task (i.e., semi-supervised learning) is worth exploring in future work.