F1 Score

Evaluation Metric: We evaluated the results with both Mean Average Precision (MAP) and F1 Score .

Page 7, “Experiments”

Macro—averaged F1 Score .0 .o .o .o .o 00 00 A A 01 00 00 00 00 00 I I I I I

Page 7, “Experiments”

(b) Macro-Averaged F1 Score

Page 7, “Experiments”

We reported both the macro-averaged MAP (Figure 1a) and the macro-averaged F1 Score (Figure 1b) on eight emotions as the overall performance of three competitive methods —Spread, SVM-Delta-IDF and SVM-TF.

Page 8, “Experiments”

In comparison, SVM-Delta-IDF significantly outperforms SVM-TF with respect to both MAP and F1 Score .

Page 8, “Experiments”

SVM-TF achieves higher MAP and F1 Score than Spread at the first few iterations, but then it is beat by Spread after 16,500 tweets had been selected and re-annotated till the eighth iteration.

Page 8, “Experiments”

Overall, at the end of the active learning process, Spread outperforms SVM-TF by 3.03% the MAP score (and by 4.29% the F1 score), and SVM-Delta-IDF outperforms SVM-TF by 8.59% the MAP score (and by 5.26% the F1 score ).

Page 8, “Experiments”

Spread achieves a F1 Score of 58.84%, which is quite competitive compared to 59.82% achieved by SVM-Delta-IDF, though SVM-Delta-IDF outperforms Spread with respect to MAP.

feature weighting

We describe computationally cheap feature weighting techniques and a novel nonlinear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost.

Page 1, “Abstract”

Following this idea, we develop computationally cheap feature weighting techniques to counteract such effect by boosting the weight of discriminative features, so that they would not be subdued and the instances with such features would have higher chance to be correctly classified.

The model’s ability to discriminate at the feature level can be further enhanced by leveraging the distribution of feature weights across multiple classes, e.g., multiple emotion categories funny, happy, sad, exciting, boring, etc..

Page 5, “Feature Weighting Methods”

While these feature weighting models can be used to score and rank instances for data clean-

Page 5, “Feature Weighting Methods”

ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms.

Page 5, “Feature Weighting Methods”

Spreading the feature weights reduces the number of data points that must be examined in order to correct the mislabeled instances.

Amazon Mechanical Turk

In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowd-source annotations.

Page 1, “Abstract”

There are generally two ways to collect annotations of a dataset: through a few expert annotators, or through crowdsourcing services (e.g., Amazon’s Mechanical Turk ).

Page 1, “Introduction”

We employ Amazon’s Mechanical Turk (AMT) to label the emotions of Twitter data, and apply the proposed methods to the AMT dataset with the goals of improving the annotation quality at low cost, as well as learning accurate emotion classifiers.

Page 2, “Introduction”

We then sent these tweets to Amazon Mechanical Turk for annotation.

Page 5, “Experiments”

In order to evaluate our approach in real world scenarios, instead of creating a high quality annotated dataset and then introducing artificial noise, we followed the common practice of crowdsouc-ing, and collected emotion annotations through Amazon Mechanical Turk (AMT).

Page 6, “Experiments”

Amazon Mechanical Turk Annotation: we posted the set of 100K tweets to the workers on AMT for emotion annotation.

Mechanical Turk

In this paper we study a large, low quality annotated dataset, created quickly and cheaply using Amazon Mechanical Turk to crowd-source annotations.

Page 1, “Abstract”

There are generally two ways to collect annotations of a dataset: through a few expert annotators, or through crowdsourcing services (e.g., Amazon’s Mechanical Turk ).

Page 1, “Introduction”

We employ Amazon’s Mechanical Turk (AMT) to label the emotions of Twitter data, and apply the proposed methods to the AMT dataset with the goals of improving the annotation quality at low cost, as well as learning accurate emotion classifiers.

Page 2, “Introduction”

We then sent these tweets to Amazon Mechanical Turk for annotation.

Page 5, “Experiments”

In order to evaluate our approach in real world scenarios, instead of creating a high quality annotated dataset and then introducing artificial noise, we followed the common practice of crowdsouc-ing, and collected emotion annotations through Amazon Mechanical Turk (AMT).

Page 6, “Experiments”

Amazon Mechanical Turk Annotation: we posted the set of 100K tweets to the workers on AMT for emotion annotation.

SVM

(2012) propose an algorithm which first trains individual SVM classifiers on several small, class-balanced, random subsets of the dataset, and then reclassifies each training instance using a majority vote of these individual classifiers.

Page 3, “Related Work”

Methods: We evaluated the overall performance relative to the common SVM bag of words approach that can be ubiquitously found in text mining literature.

Page 7, “Experiments”

o SVM-TF: Uses a bag of words SVM with term frequency weights.

Page 7, “Experiments”

SVM-Delta-IDF: Uses a bag of words SVM classification with TF.Delta-IDF weights (Formula 2) in the feature vectors before training or testing an SVM .

Page 7, “Experiments”

We built the SVM classifiers using LIB-LINEAR (Fan et al., 2008) and applied its L2-regularized support vector regression model.

Page 7, “Experiments”

Based on the dot product or SVM regression scores, we ranked the tweets by how strongly they express the emotion.

weight vector

We calculate the Delta IDF score of every term in V, and get the Delta IDF weight vector A = (A_z'df1, ..., A_idf|V|) for all terms.

Page 4, “Feature Weighting Methods”

When the dataset is imblanced, to avoid building a biased model, we down sample the majority class before calculating the Delta IDF score and then use the a bias balancing procedure to balance the Delta IDF weight vector .

Page 4, “Feature Weighting Methods”

This procedure first divides the Delta IDF weight vector to two vectors, one of which contains all the features with positive scores, and the other of which contains all the features with negative scores.

Page 4, “Feature Weighting Methods”

Let Vl be the vocabulary of dataset DZ, V be the vocabulary of all datasets, and |V| is the number of unique terms in V. Using Formula (1) and dataset DZ, we get the Delta IDF weight vector for each class 1: Al 2 (Aidff, ..., A_idf|lV|).

Page 5, “Feature Weighting Methods”

0 Delta-IDF: Takes the dot product of the Delta IDF weight vector (Formula 1) with the document’s term frequency vector.

Page 7, “Experiments”

0 Spread: Takes the dot product of the distribution spread weight vector (Formula 3) with the document’s term frequency vector.

iteratively

We describe computationally cheap feature weighting techniques and a novel nonlinear distribution spreading algorithm that can be used to iteratively and interactively correcting mislabeled instances to significantly improve annotation quality at low cost.

Page 1, “Abstract”

The process of selecting and relabeling data points can be conducted with multiple rounds to iteratively improve the data quality.

Page 1, “Introduction”

An active learner uses a small set of labeled data to iteratively select the most informative instances from a large pool of unlabeled data for human annotators to label (Settles, 2010).

Page 1, “Introduction”

In this work, we borrow the idea of active learning to interactively and iteratively correct labeling errors.

Page 1, “Introduction”

(2012) propose a solution called Active Label Correction (ALC) which iteratively presents the experts with small sets of suspected mislabeled instances at each round.

labeled data

An active learner uses a small set of labeled data to iteratively select the most informative instances from a large pool of unlabeled data for human annotators to label (Settles, 2010).

Page 1, “Introduction”

In Active Learning (Settles, 2010) a small set of labeled data is used to find documents that should be annotated from a large pool of unlabeled documents.

Page 3, “Related Work”

Due to these reasons, there is a lack of sufficient and high quality labeled data for emotion research.

Page 6, “Experiments”

Since in real world applications people are primarily concerned with how well the algorithm will work for new TV shows or movies that may not be included in the training data, we defined a test fold for each TV show or movie in our labeled data set.

Page 7, “Experiments”

Each test fold corresponded to a training fold containing all the labeled data from all the other TV shows and movies.

learning algorithms

Noise tolerance techniques aim to improve the learning algorithm itself to avoid over-fitting caused by mislabeled instances in the training phase, so that the constructed classifier becomes more noise-tolerant.

Page 2, “Related Work”

Decision tree (Mingers, 1989; Vannoorenberghe and Denoeux, 2002) and boosting (Jiang, 2001; Kalaia and Servediob, 2005; Karmaker and Kwek, 2006) are two learning algorithms that have been investigated in many studies.

Page 2, “Related Work”

For example, useful information can be removed with noise elimination, since annotation errors are likely to occur on ambiguous instances that are potentially valuable for learning algorithms .

Page 2, “Related Work”

ing, better classification and regression models can be built by using the feature weights generated by these models as a pre-weight on the data points for other machine learning algorithms .