Thursday, March 31, 2011

A large part of today's meeting was about Labeled LDA. The model was compared to standard LDA and Supervised LDA (http://www.cs.princeton.edu/~blei/papers/BleiMcAuliffe2007.pdf). It was mentioned that Supervised LDA is theoretically very nice, but in practice often is hard to get it working. For Labeled LDA, It was noted that the inference for the labels for the test documents was not worked out well (they had simplified it to standard LDA inference).

We also discussed the paper for discovering dialogue acts in Twitter. Overall, it wasn't clear was the goal was, how they defined dialogue acts, and therefore the evaluation was not very satisfying. We weren't sure what evaluation method would fit better.

We also discussed the overall popularity of looking at Twitter data nowadays. One of the main advantages is that it's easy to get (their API seems to be pretty good). Also, it's language is less formal than for example the WSJ, so could be more interesting if you're interested in more natural language.

The focus paper for this week, Characterizing Microblogs with Topic Models, uses a variation of LDA, Labeled LDA, to analyze the topics present in Twitter, and then uses these for the tasks of rating tweets and recommending users to follow. They construct several datasets for this purpose: a small corpus of tweets labeled with topics and a set of tweets rated by users that received them. Overall, the features extracted from their topic model significantly improve performance on both tasks.

Only three different related papers were read this week: a paper about modeling conversations in twitter, a paper about Labeled LDA, and a paper about analyzing twitter discourse.

Dhananjay and Alan read "Unsupervised Modeling of Twitter Conversations", Alan Ritter, Colin Cherry, Bill Dolan, NAACL 2010. This paper discusses unsupervised methods for analyzing dialog acts in series of tweets. They present three methods, the EM Conversation model, the Conversation+Topic model, and the Bayesian Conversation model, where the Conversation+Topic model was their principal contribution. They perform several types of evaluation: qualitative analysis of the output of the system, held-out likelihood, and a new task, conversation ordering, which consists of reordering a scrambled conversation.

Weisi and Dong read "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora", Daniel Ramage et al, EMNLP 2009. This paper presents the modified form of LDA that is used in the focus paper. This model allows some documents to be specified with labels, which constrains the latent topics that are allowed to be associated with that document. They evaluate the model on many different tasks, comparing to one-vs-rest SVM, although Weisi notes that comparing to vanilla LDA would have been good in order to understand the quantitative effects of adding the labels.

I read Beyond Microblogging: "Conversation and Collaboration via Twitter", Honeycutt, C. and Herring, S, HICSS 2009. This paper discusses and analyzes dialog in Twitter, particularly relating to the use of the @ symbol. The authors sampled 37k tweets over the course of the day, and hand-annotated a sample of these in order to answer questions about the presence and function of dialog in twitter, and the relation of the @ symbol to dialog. They found that dialogs are fairly prevalent on Twitter, and tended to be medium-length between two people. Further the use of the @ symbol was strongly related to discourse acts involving interacting with other people.

Wednesday, March 30, 2011

The paper proposes an unsupervised approach to task of modeling dialogue acts. It is based on conversation model by Barzilay and Lee (2004) using HMMs for topic transitions, with each topic generating a message. Another latent variable - source is added that generates a particular word. A source may be one of (1) The current post's dialogue act, (2) Conversation's topic and (3) General English. Evalation is done by comparing the probability of the held out test data using Chib's estimator. Training is performed on a set of 10000 randomly sampled conversations with 3-6 posts. An interpretation of the model based on the 10-act dialogue model is presented. I observed that their interpretation presents a transition graph which is acyclic (excluding the self loops). This probably could be because of the length of the conversations and also that transitions with probability less than 0.1 are not shown. They are however not discussed. A comparison metric based on the ordering of the posts is proposed. The posts for a particular conversation are permuted and the Kendall co-efficient is calculated.

The related paper I read this week proposes an unsupervised method for discovering dialogue structure or "dialogue acts" in twitter conversations. The idea was to automatically extract information that says something about the nature of the interactions between people in new mediums such as twitter. This is a pretty cool problem since the conversational aspects of English or any language seem to be one of the harder problems one could pose in NLP. The authors crawled Twitter using its API and obtained the posts of a sample of users, and all the replies to their posts, extracting entire conversation trees. All the data amounted to 1.3 million conversations. Only 10,000 random conversations are used, and scaling the models to the entire corpus is left for future work. The authors introduce three models, the EM Conversation model, the Conversation+Topic model, and the Bayesian Conversation model, the second being an extension of the first.

The Conversation+Topic model is basically a modified HMM borrowed from some previous work on multi-document summarization. Just using the Conversational model wasn't quite good enough since topic and dialogue structure were mixed in the results and the focus is on dialogue structure. They use an LDA framework to modify the model to account for topic and thus separate content words from dialogue indicators. For the inference engine, the HMM dp is swapped for Gibbs sampling. Slice sampling is also applied.

To evaluate the set of generated dialogue acts, the authors examine both qualitative and quantitive evaluations. The qualitative evaluation really only focuses on the Conversation+Topic model, and they go through a 10-act model, showing the probability on the transitions between dialogue acts. Most of the acts are reasonable and do a good job of illustrating the fact that Twitter is a microblog. They also display word lists and example posts for each Dialogue Act which are fairly convincing. For quantitative evaluation the paper introduces a new task of conversation ordering, that is given a random set of conversations, all permutations of the conversations are generated and the probability of each permutation is evaluated as if it were an unseen conversation. It appears that although useful, this metric does not directly imply anything about the interpretability of the model.

It seems the paper does a decent job at planting a first step in terms of unsupervised dialogue act tagging. The work done doesn't seem overly complex, but the observations and data they collected seems like they could be useful for at least a couple more runs.

The paper that I read presents an analysis of dialog on Twitter, and specifically the use of the @ symbol. The authors analyze 37k tweets sampled from four 1-hour periods in order to answer four primary questions:What is the language break-up of tweets over time, and do different languages use the @ symbol to different degree?How do English tweets use the @ symbol?What topics are present in tweets, and how does the presence of the @ symbol affect this distribution?How does the @ symbol function with regards to interactive exchanges?In answer to the first question, they found that the use of the @ symbol does not vary significantly over time or language. They categorized the use of '@' into a variety of categories: addresses, references, and several uses not related to the @username construction. Of the instances of '@' in a sample of 1500 tweets, 90% were addresses, 5% were references, and the last 5% was distributed among the remaining categories. The authors devised a semantic coding scheme for classifying tweets into one of twelve categories, and manually tagged around 200 tweets. Overall, they found that tweets with '@' in them tended to be more interactive; they were more likely to address others directly, make requests of others, or provide imformation.To analyze dialog, the authors extracted all of the conversations that occurred in the sample of tweets, and found that the typical conversation was between two people, consisted of 3-5 tweets, and occurred over a period of 15-30 minutes. Further, they found that the use of the @ symbol was strongly tied to the presence of dialog.

Sunday, March 27, 2011

I have read the paper “Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora” by Daniel Ramage et al. appeared in EMNLP 2009. This paper tackles the problem on how to make the topics to respresent some known labels, which is not addressed by standard LDA. Not aware of this paper, I have recently came up with a similar model, in which I tied the word senses to the topics, so as to discover sense indicators for different senses. I have submitted a paper on this, and now I am not sure if I am going to get a rejection... This related paper is nice in that it has conducted numerous experiments to prove the model to be useful in certain situations. However, for certain experiments, the set up is not very clear and readers have to figure out by themselves, e.g. the snippet extraction task, it might be they are training first on the same data, and then do testing for the extraction. Also, in the Tagged Web Page task, they independently select 3000 docs for cross validation, and this suggests leak of test data. Although the leak is the same for both of the two models, I feel it would be nice if they could clearly separate the training and testing data.

For the focus paper, I am not sure the way they calculate the Fleiss Kappa makes sense, because they suggest they are doing an one-vs-other for each single category, and how they are treating the agreement on the “other” category is unknown. I feel it matters here on how to calculate the probability of accident in the Fleiss Kappa. The other thing is that it would be nice if they could present a comparison against the standard LDA in the Ranking experiments, since this way we would be sure that the label information that L-LDA brought in makes a difference, because maybe the standard LDA could also achieve the same results.

The related paper introduced Labeled LDA, which is used as the main method in the focus paper. Labeled LDA defines a one-to-one correspondence between LDA's latent topics and user tags. In comparison with previous models such as Supervised LDA, this model allows documents to be associated with multiple labels. The inference is similar to the standard LDA model, except the topics of a particular document are restricted to the topic set that is associated with the labels of that document. I liked the way they evaluated the model, they evaluated it in a range of tasks: topic visualization, snippet extraction and multilabel text classification (compared with strong baseline: one vs- rest SVM).

I'm not sure what to think of the focus paper. Much of the paper builds on the four dimensions: substance, style, status and social. Although they seem to make sense, these were identified by interviewing a small and not representative group. Also the way they identified the labeled dimensions in Twitter and the mapping to these dimensions seem somewhat ad hoc, so I'm not sure what to think of their Twitter characterization. Because for the ranking experiments, they don't use the 4S dimensions but only the topic distribution of the tweet, it would have been nice if they had also compared with standard LDA.

The meeting revolved mostly around evaluation of Topic Models. For most of the time, we discussed the role focus paper played (or the objective of the focus paper). The method of using the first few lines as the summary of the topic was challenged. It was concluded that the reader should have access to the complete article, should it be wanted.

The behavior of the evaluation for models that generate n-grams was put forward.. It is interesting, how the word intrusion task would perform in such cases, for example, a topic that rates high on a bigram like New York, but rates low for York.

The final point of discussion was what role Topic models play. They are not useful as a pre-processing or a decision making criterion for any other tasks. However, it was accepted that with the advent of topic models, the research in generative graphical models matured

Friday, March 25, 2011

Yesterdays meeting focused mainly on ways of evaluating topic models.For the focus paper, the overall opinion was that the results weren't very strong. More details regarding the Mechanical Turk setup/resultswould have been nice, as well as a comparison with somewhat less similar models, such as LSA. Furthermore, it would have been nice if they have connected topic models more with human cognitive processes.During the meeting we also talked about calculating perplexity. There seems to be multiple ways of calculating this for topic models.

The author of my related paper seems to be working on presenting topic to humans. His recent publications contain a lot of papers regarding topics such as visualization of topics, external evaluation, topic labeling etc. Thus his website (http://www.ics.uci.edu/~newman/) might be a good starting point to search if you're interested in this.

A paper that was mentioned during the meeting was Not-So-Latent Dirichlet Allocation: Collapsed Gibbs Sampling Using Human Judgments by JonathanChang (http://www.aclweb.org/anthology/W/W10/W10-0720.pdf), where humans simulate the sampling step of Gibbs sampling and construct a topic model. I haven't read it yet, but it looks very interesting.

Thursday, March 24, 2011

The meeting today concentrated on issues involving the evaluation of the topic models. The discussion has been mainly revolved around the focus paper, which presents tasks to demonstrate the semantics of the topics obtained from the topic models. The inter-rater agreement, if is there, would be nice, but as suggested by Noah, the statistical significance of the results is still valid. For automatic evaluation, Dong suggested the measures in related paper does not address the issue strongly. Daniel read a paper about various statistical measures for topic models and suggested the comparison does not make enough sense. Alan read a paper about applying topic models to polylingual case to help machine translation, and it is suggested that machine translation is not helped a lot by topic models so far. Dipanjan asked about current work of applying the topic models applied to structured prediction, and my reading turned out to a good fit, since the model is an integrated bayesian model for WSD. Dhananjay read a paper about the adapting the PLSA over time. Brendan asked about how exactly the perplexity on the test data is calculated, in other words how to fix the document distributions. People have pointed out various points that are confusing in focus paper (eg. Alan pointed out about showing the topic results to the raters and claiming there is no bias does not make sense) and can be improved. It would be great if the results turned out to meet people’s expectations, such as CTM performed the best in the tasks and so on. It is suggested a lot of work could be done to improve the focus paper.

The related paper addresses the problem of changing nature of document collections. It tries to adapt the feature space and underlying document model. The idea is to generate a PLSA model for every fixed length time window. The paper provides an adaptive model for computing PLSA as a substitute to relearning for every window.

The PLSA model is parameterized by documents, words and topics. For every time window, the words and documents of the previous time window are discarded and new words are introduced. EM is used for inference. The MAP estimates of the previous window, remain as the current iteration estimates for the current window. For the current time, the documents that came before the time window are discarded, and so is the vocabulary that is not present in the current window. New words are "folded in" and the model is inferred again.

For evaluation, the authors used ACM-SIGIR conferences from 2000-2007. They compared their model with independent PLSA for a window for every time. The difference was the initialization. While the current model used the MAP from the previous computation, the independent PLSA was initialized randomly. The comparison was carried out for two windows - 1 year, and 2 years, and using the natural order and random order of documents. The average perplexity (50 iterations over k = 1, 2 4, ..., 128) for adaptive PLSA comes about 5% less than the independent PLSA.

The related paper introduces a polylingual topic model that discovers topics aligned across multiple languages. They look at documents that are translated word-for-word via the EuroParl corpus as well as those that are not directly translated but very likely to be about similar concepts (wikipedia articles in various languages). The aim is to evaluate whether PLTM can accurately infer topics over direct translations, infer similarities between vocabularies in different languages, and detect differences in topic emphasis between languages.

PLTM is an extension of latent Dirichlet allocation (LDA) and topic assignments can be inferred using Gibbs sampling. The generalization ability of the model is based on the probability of previously unseen held-out document given posterior estimates. To evaluate the possibility of using PLTM for adapting machine translation systems, the authors also measure the ability of the model to align documents in one language with their translations in another language.

Because I have one meeting and one class consecutively Thursday afternoon before the meeting, I may not have enough time to cover posts submitted after 1:00pm. However, I will try my best to update and cover all the new posts in time.

For this week’s focus paper, people have read different papers regarding the interpretation and application of the topics generated by the topic models.

Daniel has read the paper “Evaluation Methods for Topic Models.” by Wallach et al. ICML 2009. The paper explores various methods for evaluating LDA. The authors conduct the experiments in two settings: held-out documents, in which entire documents are held out, and document completion, in which only the latter half of each document is held out. The methods includes harmonic mean sampling, annealing importance sampling, estimated theta, Chib-style estimation, and left-to-right evaluation. The evaluation methods are compared by checking which method assigns higher probabilities to the held-out data, as well as the variance and computational complexity. The method for comparing evaluation methods does not seem well-justified, as noted by Daniel.

Dong read the paper “Automatic Evaluation of Topic Coherence” by Newman et al NAACL 2010, which is an extension of the focus paper. The paper explores 15 different automatic measures utilizing WordNet, Wiki, Google and a bunch of external methods. Dong suggests that more error analysis could be given and how well the method generalizes over domains is not known. The comparison on the ratings are also suggested to be not entirely convincing. About the focus paper, Dong suggests that the sensitivity of the parameters might affect the strength of their results.

I myself read the paper “A Topic Model for Word Sense Disambiguation” by Jordan Boyd-Graber, David Blei, and Xiaojin Zhu in EMNLP 2007. The paper is not exactly about the evaluation of topic models, but explores as an application of the semantic influence of the topics obtained by the topic models. The model of WSD is a packed model of P(topic | Corpus) and P(sense | topic). For the second model, the author uses WordNet-Walk which is model over all the possible paths on WordNet to a specific target word. The model is elegant in the sense that its components are modular. However, it does not work very well because of the structure of WordNet. There are also issues about estimation of P(sense | topic) where the path length is not paid enough attention. The perk of the whole model is that it is modular and can be integrated into bigger models. However, how well the approximate inference would work is not known. About the focus paper, I feel the authors could have provided the inter-rater agreement, because the raters are not very reliable.

Dhananjay read the paper “Topic Evolution in a Stream of Documents” by A e Gohr et al. in SIAM: 2009 Data Mining. The paper discusses about adapting topic models over time using PLSA. A PLSA model is obtained for each time period and for each new time period the estimates from the previous time period is used for intialisation. Comparison with randomly initialized PLSA model demonstrates a improvement of 5% in perplexity, as noted by Dhananjay.

Alan read the paper “Polylingual Topic Models” by Mimno et al. in EMNLP 2009. The paper is aimed at exploring the LDA to infer topics over direct translation, capture similarities between vocabularies in different languages, and detect differences in topic emphasis between languages. The evaluation is conducted on held out data using likelihood.To evaluate the possibility of using it for adapting machine translation systems, the authors also measure the ability of the model to align documents in one language with their translations in another language, as noted by Alan.

This paper explores various intrinsic methods of evaluation for topic models, focusing on LDA. All of the methods are various ways of estimating the probability of held-out test documents given the model. The held-out documents are either entirely held out, or used in a document completion setting, where only the latter half of each test document is held out. For the completely held-out setting, they compare two commonly used methods, harmonic mean sampling and annealed importance sampling, to methods that had not previously been used for topic model evaluation, Chib-style estimation and left-to-right evaluation. For the document completion setting, they compare older methods, annealed importance sampling and estimated theta, to left-to-right evaluation. The methods are compared by seeing which assigns higher probabilities to the held-out data, as well as by looking at the variance and computational complexity of each method. The Chib-style estimation and left-to-right evaluation are determined to be the best. This method for comparing evaluation methods does not seem well-justified, and the authors do not go into any detail about why they used the procedures that they did.

Monday, March 21, 2011

I think the focus paper addressed an important topic. Topic models seem to be used a lot for visualization etc., thus it is important to be able to measurehow interpretable a topic is. I liked the way they evaluated,although I'm wondering how sensitive their results are too the used inference methods/parameters.

My related paper essentially builds on the work by Chang et al. by proposingautomatic methods to measure topic coherence/interpretability.They tried a bunch of external methods, using Wordnet, Wikipedia and Google(a total of 15 different measures).I like the range of methods they tried, but it would have been nice if theyhad said some more regarding error analysis etc. Now I felt most of the paperwas about explaining all the different measures. They found that Wikipedia based method achieved very good results.They used annotators to rate the topics on a 3-point scale. They view their upper bound as the inter-annotator agreement. It seems a bit strange that some of the methods, have an even higher score than that , and a lot of the methods are very close to the upperbound score. So I'm wondering if their comparison with the upper bound (which they use to conclude that their methods perform really well) makes sense.In addition I wonder how well their methods work when the domains are very specific and not well covered by the Web/Wordnet/Wikipedia. It also wasn't clear to me also how they mapped terms to Wikipedia pages, which is not a trivial thing to do.

Based on the concept of creating a metric space of clusterings, I wondered if there might be a way to use the motivation for adaboost - an ensemble of multiple weak learners can be used to create a strong learner.

What if the dimensions of this metric space is a set of weak learners, and the human observation is used to generate weights that create a decent clustering application? I don't have a particular example or application to illustrate this point, but perhaps somebody might want to help me out over here?

Saturday, March 19, 2011

I have read “A Topic Model for Word Sense Disambiguation” by Jordan Boyd-Graber, David Blei, and Xiaojin Zhu, appeared in EMNLP 2007. For the focus paper, one thing I feel that is missing is the inter-rater agreement, especially in such as situation where there are a group of not highly reliable raters. The agreement might be calculated using measures such as Fleiss’ Kappa. Another issue I feel is that they could also address the problem of semantic similar topics (topics that have similar distributions), this problem results as that the users can not disambiguate the intruder from the intended topic words (as the authors suggested). In real semantic applications, people could wish to get rid of such semantic duplicated topics, and could do so if the similarity between two topics are known.

As the author suggested that the task specific evaluation measures should be preferred over perplexity, I decided to read the paper above. The Boy-Graber 2007 paper proposes a hierarchical bayesian model that integrates two bayesian models: LDA and WordNet-Walk. In a graphical model view, the model basically inserts into the LDA model a path node between the topic node and the word node. The state space of the node spans all the paths from the root synset to the a specific synset that could generate the observed word in the WordNet. The training is done through Gibbs Sampling and during inference, the lowest synset node of the path with max probability coming out from the assigned topic is assigned as the sense to the target word. The evaluation is done on different topic numbers and the results turn out to be inferior to the state-of-art. Besides the issues raised in the error analysis, I feel the model also does not address the path length issue--even if the parameters estimated for each edge on path is big, if the path is much longer than the competing paths, the model would not be able to pick the correct sense confidently. Being bayesian, the model can be integrated into a bigger model, as the authors claimed, however, as with the Haghighi paper, the model itself is complex for inference, when combined into a more complex model, the inference is going to much complex and how good the approximation could be is unknown. In general, this paper provides an elegant generative model and shows that more knowledge and sense specific features are needed in order to do better in WSD.

Wednesday, March 16, 2011

This paper provides a different approach to deriving the distance metric used to compare clusterings in the focus paper. The goal of this metric, called variation of information (VI), is to be intuitive as well as to possess desirable mathematical characteristics. From a variety of axioms, the following definition is derived VI(C,C')=H(C|C')+H(C'|C), where H denotes the conditional entropy. This function is a true metric, which makes possible more types of reasoning about the space of clusterings. VI is not directly dependent on dataset size, so distances are comparable between different datasets. The author proves that VI is the only function that satisfies all the desired properties, although some of these properties are somewhat nonintuitive, relating to the way the metric interacts with combinations of clusters. The principal advantage of this metric over others is the fact that it is comparable across datasets and experimental conditions without rescaling, which is generally not mathematically justified.

This paper presented several interesting points on the topic. First it sets up a formal definition of the combination clustering problem, which basically says given some set of clusterings for a set of data, find a cluster which is some combination of one or more of those sets that yields a "better" cluster. The paper formalizes this combination problem basically into a clustering problem of itself, where the critical aspect of the combination problem is the consensus function that defines how different clusters are combined. Then there follows a proof of relating this to a median partitioning problem and intra-class variance criterion. Seven different consensus functions are studied, where evaluation is done by measuring the mis-assignment rate of the consensus partition (the true known number of clusters is made available). A second point of the paper focuses on combining weak clustering algorithms, specifically clustering (via k-means) the data after projection into a lower dimension, or splitting the data randomly via hyperplanes. Experiments are combined with the consensus functions and applied in different dimensions for different values of k.

The idea in this paper seems pretty intuitive which probably explains why it was published some time ago. Theoretically it makes sense that you should be able form a more optimal cluster by combining multiple clusterings of a set of data, but because you could also telescope the argument, the practicality and marginal benefits seem limited in significance.

The paper evaluates various metrics for clustering. The authors list down the following required criteria for the constraints that contribute in evaluating a metric - (1) Each constraint should address a limitation of the metric. (2) For any metric, there should be an analytical way to prove whether the metric satisfies the contraint and (3) The constraint should discriminate between metric families.

Based on these criteria, the author suggest four constraints - (1) Cluster Homogeneity - a coarser cluster containing heterogeneous items have a lower score than finer clusters of homogenous items. (2) Completeness - Homogenous items should feature in the same cluster. (3) Rag Bag - Disorder in a noisy (heterogeneous cluster) is favored than in a homogenous cluster and (4) Cluster size vs. quantity - Small error in big cluster is favored to large number of small errors in small clusters.

Set matching metrics fail on cluster completeness and rag bag as the bias is towards small clusters. Entropy based methods fail generally fail on rag bag. Pair counting satisfy both homogeneity and completeness, but don't address rag bag and cluster size vs. quantity. The b-cubed family of metrics satisfy all the four kinds of constraints.

In the focus paper, the distance between two clusters is represented by the sum of different clusterings for a pair of items. It may seem that only homogeneity and completeness are addressed over here. However, since none of the clusterings represent true clustering, constraints (3) and (4) probably are not required.

Friday, March 11, 2011

I read the paper “Meta Clustering” by Rich Caruana etc. in 2006 ICDM. The paper describes meta-clustering which is mentioned in the focus paper. It also has lots of similarities to the focus paper and it seems that focus paper has taken some from this paper, eg. the similarity metric between clusterings. The Meta clustering procedure has been decomposed into 3 steps: 1) generate the base level clusterings; 2) define the similarity metric between the base level clusterings; 3) conduct the meta-clustering on the base level clustering using some clustering methods such as agglomerative clustering. The goal is to represent a meta-clustering of the base level clusterings to the user, so the user would spend less time going through all the base level clusterings to find the best one. For 1), random feature weights are assigned for clustering and PCA are conducted to remove correlated features. For 2), the percentage of pairs of instances that are treated differently (the pair being in one cluster in one clustering, and in two different clusters in the other) in two clustering is used as the metric.The evaluation is done on several data sets with user labels to calculate compactness and accuracy as performance measure. The main lesson from experiments, as the authors suggest, is that when the correct clustering criteria is not specified in advance, searching a single, optimal compact clustering is not appropriate since correctness criteria might not correlate strongly the compactness. This also serves to be the one of main motivations of this paper and the focus paper.

The related paper focuses on estimating proportions of categories (classes), instead of doing individual classifications (what is mostly done in computer science). They first review two existing approaches to estimate proportions: 1) sample a subset and hand label them to estimate the category proportions, 2) Do individual classification, and aggregate the predictions to calculate a proportion. They explain why both approaches have problems, and then propose two new methods. The first one applies existing individual classification techniques, but then estimates the errors per category and corrects the aggregated category proportions. The second one estimates proportions directly without doing individual classification. The problem can be framed as a regression problem and the class proportions are the regression coefficients. Because of computational and sparsity issues, they sample subsets of words, and estimate it for each set. The results are then averaged.

I think the paper gave a nice explanation of the goals and alternatives. Furthermore, it was interesting because I've never thought about estimating proportions instead of individual classifications, and now know why you would want to use other methods instead of just aggregation individual classification predictions.

I will use this post-meeting commentary to try to clarify some things about the paper I read (Ye et. al, Coling-ACL '06)

The paper presents and tests a hypothesis about what additional information human translators use to determine the tenses to be used during translation (in this case between Chinese and English). Since human translators still outperform current automated systems, the idea behind this is to identify where effort should be focused for advancing automatic extraction methods.

The experiment consists of training both conditional random fields (CSFs) and classification trees on surface features and latent features (alone and together) and evaluating the accuracy by evaluating against the tenses from gold-standard/best-ranked human-generated English translations (Of course here we are only considering whether the classifier can correctly predict the verb tense to use for the translation, so for Chinese to English, the classifier predicts which tense of the English verb is best suited). The paper also details how these gold-standards are generated from all the human annotations they collected.

In terms of surface features versus latent features, it seems that surface features include those that can be easily extracted, i.e. whether the verb is inside a quote, the presence of signal adverbs either before or after the verb, the presence of signal markers between two verbs, distance between characters, whether the verbs are in the same clause, etc. When the paper talks about latent verbs it addresses mainly three features. One is telicity, which specifies whether the verb can be bound within a certain time frame. Another is punctuality, which says whether a verb can be associated with a single point event. The third is temporal ordering, which describes one of six relationships between two invents, namely precession, succession, inclusion, subsumation, overlap, and none. The idea is that latent variables require deeper semantic analysis of the source text that is generally only feasible by human processing.

So in the end, the results simply show that the classifiers trained only on latent features outperform those trained only on surface features, which provides support for their simple claim that if we were able to achieve some deeper semantic analysis, then we would definitely be able to improve our approach towards this problem and maybe tense/disambiguation problems in general.

In hindsight, maybe not the greatest of papers, but I thought I would try to clear some things up since it was pretty rough when I presented yesterday. It's very possible I may have misunderstood some part so please let me know if something seems off.

Because the focus paper proposes a new task, it was less clear what work was going to be relevant. People read papers about a diverse set of topics, including WSD, Machine Learning and linguistics. Overall people seem to like a paper a lot, the most doubts people had was on the way the authors had sampled the data.

I think the work was interesting and I'm curious to see what other approaches people are going to take working on this problem. For my research, we have been looking at posts from users over time on a forum. We were interested in looking at the development of users over time. However, including all the text of users together gave very noisy results, because it was including a lot of text where users were talking about past experiences (for example often when they were talking in present form when they were telling a story). It was very difficult to seperate the current state of users with past/future. A good way to disambiguate tense could definitely help.

For the focused paper, the authors come up with the idea of tackling a smaller semantic problem which has not been tackled before. They formulated this problem of Tense Disambiguation, which is related to WSD but not exactly the same. The influence of the problem is manifold. It can be applied to help many other problem such as machine translation, text entailment and so on.

For machine learning techniques, I read a paper that discusses a cascading model where the model is a pipeline of classifiers and the output of the previous classifier is used to identify the next classifier, and then classify the instance by the next classifier. Dong read a paper that explore the combination of classifiers to address the WSD. The authors experimented 6 different classifiers and different ways to combine them.

For WSD methods, beside the paper Dong read, Daniel read a paper that exploits novel syntactic features to improve the precision of WSD. The authors use minipar to generate the dependencies to be used as features. For classifier, they used Decision List and AdaBoost.

For temporal model, Dhananjay read a paper about ordering events based on temporal relations. The justifications of superiority of Integer linear programming against Greedy methods are discussed. The authors used SVM based transitivity to obtain output as pairwise ordering of temporal events.

For linguistics, Brendan read the paper about construction grammars. The definition of construction is discussed as well as the tree substitution grammar and decision oriented parsing.

For translation, Alan read a paper on translating verbs in Chinese to English, regarding the difficulty of tense. The authors explore latent features (not exactly features involving hidden variables) such as temporal features to show the effectiveness of such features. Ablation and comparison with surface features are conducted.

The focus paper introduces a task called tense sense disambiguation, where given a concrete tense syntactic form in a sentence, the objective is to select the correct sense among a given set of possible senses.

As an experiment, an English syntactic sense dictionary was compiled and then used to annotate 3000 sentences from the British National Corpus. A supervised learning TSD algorithm was developed that used basic, lexical, and POS Tag-based features. The algorithm also caters to the task structure by allowing the restriction of the possible labels to each concrete syntactic form. The classifier outperforms the MFS baseline in all three conditions when the ASF (abstract syntactic form) type is known, unknown, or given by a simple rule-based classifier.

So for this week, people read the following supplemental papers:

Weisi read about a prototype model for coarse-to-grain learning with regards to multi-class classification. The model was pipelined with sets of independent classifiers trained at each level and since it does not use the overlap in the confusion lists on certain labels, there is the possibility for a better estimation of the feature weights.

Dong read about experiments in combining classifiers using various techniques to improve performance with regards to WSD. Although the features or combinations themselves may not have been that unique, the end result of the paper showed improved performance from such an approach.

In somewhat of a contrast, Daniel read about novel syntactic features used to improve WSD performance. The paper uses supervised machine learning methods, decision lists, and AdaBoost, as well as added features for subcategorization frames and dependencies on words already present with some given sense. Thresholding was also used to trade recall for precision. In the end, the specific syntactic features along with AdaBoost performed the best, although AdaBoost has no effective thresholding mechanism.

Dhananjay read about imposing global constraints over local pairwise order decisions. One is transitivity and the other is time expression normalization. Transistivity uses Integer LP while time expression normalization normalizes everything to a single timeline. The results show 1-2% absolute increase in accuracy.

Brendan read an overview of construction grammar. A construction is a paring of a meaning and form. They require consideration of communicative functionality, meaning, and general cognitive constraints. There are some good examples illustrated in the link on Brendan's post. It seems the relevance of the theory and details in terms of using it or comparing it in some applicable manner are still unclear.

A construction is a "pairing of form and meaning." Words, multiwords, morphological features, etc. are all constructions. To understand them you have to think about communicative functionality, meaning, and general cognitive constraints.

That's kind of it. Some examples:

It's a very NLP-friendly view of language. They're like the all the little features you get in MT or other tasks.

It's kind of like lexical semantics, except for more than just words.

It's kind of dissatisfying, in that there's no theory how words combine into meaning.

It's kind of not much of a theory. I don't see how it explains or makes testable hypotheses or predictions. Or even how you would use it to help design a grammar or NLP features, even. I don't see how you could hope to compare it to LFG or CCG or HPSG or minimalism or what have you.

I see how it might have helped Reichart and Rappoport think about these things, or something. I dunno if I should bother reading the book though.

Wednesday, March 2, 2011

This week, I read Jointly Combining Implicit Constraints Improves TemporalOrdering by Nathanael Chambers and Dan Jurafsky. The paper proposes imposingglobal constraints over the local pairwise order decisions. It proposes twoconstraints (1) Transitivity and (2) Time expression normalization.

Transitivity. A soft classification (before, after and unknown) is done usingSVM and confidence scores are calculated for each pairwise events. The objectivefunction is maximized subject to the constraints that only one (before, after,and unknown) is chosen. Transitivity is imposed over the connected components ofa digraph that r_1 + r_2 - r_3 \leq 0. Integer Linear Programming is used tofind the optimal solution. This however, didn't find any advantage over theTimebank Corpus as the graph was very sparse and connected components were veryfew. To eliminate this, they create a transitive closure.

The second global constraint that was imposed was normalizing the timeexpressions to a single timeline. The document's publication date was consideredto be the current point in time, and phrases like "last month", "next Friday"are normalized.

After using global as well as transitivity, the authors report 1-2% (absolute)increase in accuracy

This paper explores novel syntactic features for use in improving precision in Word Sense Disambiguation. The authors use existing supervised machine learning methods, Decision Lists, and AdaBoost. In addition to the standar n-gram type features already in use by earlier models of WSD, the authors add features for dependencies to specific words being present with a given word sense, as well as features for subcategorization frames. The authors then experimented with various types of thresholding in order to boost precision at the expense of recall.When no thresholding, they found that the specific syntactic features helped precision dramatically, but the subcategorization features led to a higher F score, with the combination of the two being slightly better than the subcategorization features. They further find that AdaBoost works significantly better than Decision Lists when syntactic features are taken into account, although not with the basic feature set. AdaBoost combined with syntactic features outperforms either method with only basic features.The authors found that there was not an effective thresholding mechanism for AdaBoost, but with decision lists, by only using features with high confidences, it was possible to prune the results to 95% precision and 7% recall.

To recap I read "Latent features in automatic tense translation between Chinese and English."

The paper is a very basic introduction to the difficulties inherent in translating Chinese verbs to English, particularly with regards to the task of determining tense. The paper pushes the idea that the deeper features used by humans when translating needs to be identified and also should be the focus in terms of creating more advanced automatic extraction methods. As a method of comparison and evaluation, tense classifiers with latent features are demonstrated to have better performance than those only using surface features. The paper seems to promote a more human-based cognition approach to the problem and NLP in general, targeting the information that currently remains difficult to extract.

I feel Chinese may have just been chosen to demonstrated maybe the more interesting or extremal case of the tense classification problem. The basic feature space explored includes surface features, latent features, telicity and punctuality features, and temporal ordering features. It was fairly comfortable to understand the basic ideas here due to my relative fluency in Chinese, and I won't go into the details on this blog post.

The paper describes experiments using both CRF (conditional random fields) learning experiments and classification tree learning experiments. The classifiers are trained on both surface features and latent features separately and then together. Both yield similar accuracy and perform generally 15% better than baseline systems. I feel like a large part of the paper is to maybe reorient focus on extracting latent features as the authors leverage the difficulty of the task as the key to advancing our current extraction methods.

The related paper by Florian and Yarowsky describe experiments with different methods of combining classifiers to improve the performance on the word sense disambiguation task. They experimented with 6 different classifiers, and different methods of combining them (weighted average of posterior probability, combining based on order statistics and voting). Their features included bigrams, trigrams, BOW, and syntactic features. They showed that high performance gains could be gained by combining them. Their final approach ('stacking'), was a combination of different combinations of classifiers.I think the main contribution of the paper was combining classifiers on WSD disambiguation. It doesn't seem their features itself are very innovative or was their focus. Furthermore, it's not clear how innovative the used classifier combinations were in general (not restricted to WSD).

The focus paper was interesting, but because I'm not familiar with some of the approaches their work builds on (sequential model, SNOW), some parts were not totally clear to me. Also, I found some of their defined senses seem to be a bit strange (i.e. 'when describing the content of a book' seems to be very specific).

I have read the paper “A Sequential Model for Multi-Class Classification” by Yair Even-Zohar and Dan Roth. The paper disusses a prototype model for coarse-to-grain learning, but in the domain of multi-class classification, instead of structured prediction. The model is a pipelined model with a sequence of classifiers covering different feature space and output space, and the output space of the previous classifier is pruned by thresholding to generate the output space of the next classifier. This seems to suggest an explosion of classifiers if not well engineered. During training, the classifiers are trained based the pruned output space of the previous classifier and instances relevant to the pruned output space. The authors provide a proof on bigger output space induce more error and smaller output space reduce the training error. The idea gives the feeling that it basically training a set of independent classifier at each level, and does not take advantage of the overlap of the confusion lists on certain labels, which could result in better estimation for weights of the features f(x, overlapped_label). For example, in WSD, we engineer the features to reflect only semantic correlation, and not bond to the target words, in which case we would have one single classifier instead of multiple classifiers, with one for each target word respectively, and the features would be better estimated because of removal of the partition of training examples enforced by the target words. For the focus paper, I am curious about the methodology they used to generated the senses except for time constraints.