The automatic discovering of noteworthy patterns and trends in large document collections is the goal of our text mining projects. ​ We are interested not only in identifying such patterns and trends but also in revealing them to human users through useful user interfaces. ​ Our hypothesis is that the process of discovering meaningful patterns is best accomplished by cooperation between automatic methods and human expertise. ​ Our work involves models for document clustering and for topic modeling. ​ The interplay between such models and interactive user interfaces is an area of current investigation.

The automatic discovering of noteworthy patterns and trends in large document collections is the goal of our text mining projects. ​ We are interested not only in identifying such patterns and trends but also in revealing them to human users through useful user interfaces. ​ Our hypothesis is that the process of discovering meaningful patterns is best accomplished by cooperation between automatic methods and human expertise. ​ Our work involves models for document clustering and for topic modeling. ​ The interplay between such models and interactive user interfaces is an area of current investigation.

| ::: | ''' ​In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013) ''' ​ |

-

* <​strong>​In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2013)</​strong>​

+

| ::: ​| ​Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore,​ the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability,​ isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method. ​ ​| ​

-

* Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore,​ the topic analysis suffers from a lack of identifiability between topics across independently analyzed corpora but also across distinct runs of the algorithm on the same corpus. This paper introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). Both of these methods estimate topic-word distributions a priori from Wikipedia articles, with each article corresponding to one topic and the article title serving as a topic label. LDA-STWD and EDA overcome the nonidentifiability,​ isolation, and unintepretability of LDA output. We assess their effectiveness by means of crowd-sourced user studies on two tasks: topic label generation and document label generation. We find that LDA-STWD improves substantially upon the performance of the state-of-the-art on the document labeling task, and that both methods otherwise perform on par with a state-of-the-art post hoc method.

* We present computational models capable of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered from human volunteers. We use this model to play human-directed and computer-directed word guessing games (games with a purpose similar to Catch Phrase or Taboo) and show that this model can measurably convey and understand some aspect of word meaning. The results highlight the fact that human-derived word associations and corpus-derived word associations can play complementary roles in semantic models.

* We introduce a new supervised topic model that uses a nonparametric density estimator to model the distribution of real-valued

-

metadata given a topic. The model is similar to Topics Over Time, but replaces the beta distributions used in that model with a

-

Dirichlet process mixture of normals. The use of a nonparametric density estimator allows for the ﬁtting of a greater class of

-

metadata densities. We compare our model with existing supervised topic models in terms of prediction and show that it is capable of discovering complex metadata distributions in both synthetic and real data.

* Proceedings of the 7th International Conference on Open Source Systems ​(OSS 2011)

+

| ::: ​| ​We present ​computational models capable ​of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered ​from human volunteers. We use this model to play human-directed and computer-directed word guessing games (games with a purpose similar to Catch Phrase or Taboo) and show that this model can measurably convey and understand some aspect of word meaning. The results highlight ​the fact that human-derived word associations and corpus-derived word associations can play complementary roles in semantic models. ​| ​

* Proceedings of the 7th International Conference on Open Source Systems (OSS 2011)

-

* Large commits, which we refer to as "Cliff Walls",​ are a significant challenge to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits.

* Proceedings of the Workshop on Challenges of Data Visualization (NIPS 2010)

-

* We present the Topical Guide (formerly "the Topic Browser"​),​ an interactive tool that incorporates both prior work in displaying topic models as well as some novel ideas that greatly enhance the visualization of these models. ​ The Topical Guide is a general tool for browsing the entire output of a topic model along with the analyzed corpus. ​ With expert interaction,​ the Topical Guide together with the underlying topic models can provide valuable insights into a given corpus. ​

* We show the effects both with document-level topic analysis ​(document clustering) and with word-level ​topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines ​as word error rates increase. ​Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit ​failure trends ​similar ​to models trained on unprocessed OCR output in the case of LDA. ​

* We use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We speciﬁcally use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally,​ we apply a reﬁnement step, using EM, to the ﬁnal output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.

* In Proceedings of the Conference on Knowledge Discovery and Data Mining (KDD 2008)

-

* We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a mixture of multinomials model using Gibbs sampling, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm. ​ We shed further light on effective the use of Gibbs sampling for document clustering. ​

* In Proceedings of the Second IEEE International Conference on Semantic Computing (ICSC 2008)

+

| ::: | We introduce a new supervised topic model that uses a nonparametric density estimator to model the distribution of real-valued metadata given a topic. The model is similar to Topics Over Time, but replaces the beta distributions used in that model with a Dirichlet process mixture of normals. The use of a nonparametric density estimator allows for the ﬁtting of a greater class of metadata densities. We compare our model with existing supervised topic models in terms of prediction and show that it is capable of discovering complex metadata distributions in both synthetic and real data. |

-

* We consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, and several algorithms improve upon baseline performance.

+

-

We also analyze performance as the granularity of the classiﬁcation problem moves from two-class (positive vs. negative) towards inﬁnite-class (real-valued).

| ::: | Large commits, which we refer to as "Cliff Walls",​ are a significant challenge to studies of software evolution because they do not appear to represent incremental development. We used Latent Dirichlet Allocation to extract topics from over 2 million commit log messages, taken from 10,000 SourceForge projects. The topics generated through this method were then analyzed to determine the causes of over 9,000 of the largest commits. We found that branch merges, code imports, and auto-generated documentation were significant causes of large commits. ​ |

| ::: | We present the Topical Guide (formerly "the Topic Browser"​),​ an interactive tool that incorporates both prior work in displaying topic models as well as some novel ideas that greatly enhance the visualization of these models. ​ The Topical Guide is a general tool for browsing the entire output of a topic model along with the analyzed corpus. ​ With expert interaction,​ the Topical Guide together with the underlying topic models can provide valuable insights into a given corpus. ​ |

| ::: | We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA. |

| ::: | We use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We speciﬁcally use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally,​ we apply a reﬁnement step, using EM, to the ﬁnal output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets. |

| ::: | We explore the convergence rate, the possibility of label switching, and chain summarization methodologies for document clustering on a mixture of multinomials model using Gibbs sampling, and show that fairly simple methods can be employed, while still producing clusterings of superior quality compared to those produced with the EM algorithm. ​ We shed further light on effective the use of Gibbs sampling for document clustering. ​ |

| ::: ​| ​We consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, and several algorithms improve upon baseline performance. We also analyze performance as the granularity of the classiﬁcation problem moves from two-class (positive vs. negative) towards inﬁnite-class (real-valued). ​ ​| ​