Most text mining tasks, including clustering and topic detection, are based on
statistical methods that treat text as bags of words. Semantics in the text is
largely ignored in the mining process, and mining results often have low
interpretability. One particular challenge faced by such approaches lies in short
text understanding, as short texts lack enough content from which statistical
conclusions can be drawn easily. In this paper, we improve text understanding by
using a probabilistic knowledgebase that is as rich as our mental world in terms
of the concepts (of worldly facts) it contains. We then develop a Bayesian
inference mechanism to conceptualize words and short text. We conducted
comprehensive experiments on conceptualizing textual terms, and clustering short
pieces of text such as Twitter messages. Compared to purely statistical methods
such as latent semantic topic modeling or methods that use existing
knowledgebases (e.g.,WordNet, Freebase andWikipedia), our approach brings
significant improvements in short text understanding as reflected by the
clustering accuracy.