Abstract

The majority of text mining systems rely on bag-of-words approaches, representing
textual documents as multi-sets of their constituent words. Using term weighting
mechanisms, this simple representation allows to derive features that can be used as
input by many different algorithms and for a variety of applications, including document
classification, information retrieval, sentiment analysis, etc. Since the performance of
many mining algorithms directly depend on term weights, techniques for quantifying
term importance are of paramount importance in text processing.
This thesis takes advantage of recent advances in keyword extraction mechanisms,
which further select the terms with the highest weights to keep only the most important
words. More precisely, building on a recent keyword extraction technique, we
develop novel text mining algorithms for information retrieval, text segmentation and
summarization. We find these algorithms to provide state-of-the-art performance using
standard evaluation techniques. However, contrary to many state-of-the-art algorithms,
we try to make as few assumptions as possible on the data to analyze while keeping
good computational performances, both in terms of speed and accuracy. As such, our
algorithms can work with inputs from a variety of domains and languages, but they
can also run in environments with limited resources. Additionally, in a field that tends
to be dominated by empirical approaches, we strive to rely on sound and rigorous
mathematical principles