3
Introduction  The major problem of this approach is the high dimensionality of the feature space.  The feature space is consists of the unique terms that occur in documents which can be in tens or hundreds of thousands of terms.  This is prohibitively high for many learning algorithms.

4
Introduction  High dimensionality of feature space is a challenge for clustering algorithms because of the inherent data sparseness.  Concept of proximity or clustering may not be meaningful in high dimensional feature space.  The solution is to reduce the feature space dimensionality.

5
Feature Selection  Feature selection methods include the removal of non-informative terms.  The focus of this presentation is the evaluation and comparison of feature selection methods in the reduction of a high dimensional feature space in text clustering problems.

6
Feature Selection  What are the strengths and weakness of existing feature selection methods applied to text clustering?  To what extend can feature selection improve the accuracy of a classifier?  How much of the document vocabulary can be reduced without losing useful information in category prediction?

8
Information Gain (IG)  Information gain is frequently employed as a term-goodness criterion in the field of machine learning.  It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document.

10
Information Gain (IG)  Given a training corpus, for each unique term, information gain is computed, and removed from the feature space those terms whose information gain was less than some predetermined threshold.  The computation includes the estimation of the conditional probabilities of a category given a term, and entropy computations.  The probability estimation has a time complexity of O(N) and space complexity of O(VN) where N is the number of training documents and V is the vocabulary size.

11
Χ 2 Statistics (CHI)  The Χ 2 statistic measures the lack of independence between t and c and can be compared to Χ 2 distribution with one degree freedom.  Using contingency table of a term t and a category c, where A is the number of times t and c co-occur, B is the number of time the t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occurs and N is the total number of documents, the term-goodness measure is

12
Χ 2 Statistics (CHI)  The Χ 2 statistics has a natural value of zero if t and c are independent.  For each category of Χ 2 statistic between each unique term in a training corpus and that category Χ 2 avg (t) = Σ P r (c i ) Χ 2 (t, c i )

13
Document Frequency (DF)  Document frequency is the number of documents in which a term occurs.  Document frequency is computed for each unique term in the training corpus and removed from the feature space those terms whose DF is less than some predetermined threshold.  Rare terms are either non-informative for category prediction, or not influential in global performance.  Observation: Low DF terms are assumed to be relatively informative and should not be removed aggressively.

14
Term Strength (TS)  Term strength is originally proposed and evaluated by Wilbur and Sirotkin for vocabulary reduction in text retrieval.  This methods estimates term importance based on how commonly a term is likely to appear in “closely- related” documents.  It uses a training set of documents to derive documents pairs whose similarity is above threshold.  Term strength is then computed based on the estimated conditional probability that a term occurs in the second half of a pair of related documents given that it occurs in the first half.

15
Entropy Based Ranking  Consider each feature F i as a random variable while f i as its value. From entropy theory, entropy is: E(F 1,…,F M ) = - Σ f1 … Σ fM p(f 1, …,f M ) log(p(f 1, …,f M ) where p(f 1, …,f M ) is the probability or density at the point f 1, …,f M.  If the probability is uniformly distributed and we are most certain about the outcome, then entropy is maximum.

16
Entropy Based Ranking  When the data has well-formed clusters, the uncertainty is low so is the entropy.  In the real-world data, there are few cases that the clusters are well-formed.  Two points belonging to the same cluster or 2 different clusters will contribute to the total entropy less that if they were uniformly separated.  Similarity S i1,i2 between two instances X i1 and X i2 is high if the 2 instance are very close and S i1,i2 is low if the 2 are far away. Entropy E i1,i2 will be low if S i1,i2 is either high or low, and E i1,i2 will be low otherwise.

17
Entropy Based Ranking where S i,i is the similarity value between document d i and d j and d j * S i, j is defined as follows: S i, j = e – α x dist i,j α = - ln(0.5) / dist where dist i,j is the distance between the document d i and d j after the term t is removed

18
Term Contribution  Text clustering is highly dependent on the documents similarity.  Sim(d i, d j ) = Σ f(t, d i ) x f(t, d j ) where f(t, d i ) represents the weight of term t in document d  tf * idf is also represents the weight of a term in document d where tf is term frequency and idf is the inverse document frequency

19
Term Contribution  The contribution of each term is the overall contribution to documents’ similarities and shown by the following equation: TC(t) = Σ f(t, d i ) x f(t, d j )

21
Experiments  K-Means algorithm is chosen to perform the actual clustering  Entropy and Precision measures are used to evaluate the clustering performance  10 sets of initial centroids are chosen randomly  Before performing clustering, tf * idf (with “ltc” scheme) is used to calculate the weight of each term.

22
Performance Measure  Entropy – Entropy measures the uniformity or purity of a cluster. The Entropy for all clusters is defined by the weighted sum of the entropy for all clusters where

23
Performance Measure  Precision – For each cluster, choose the class labels which shares most documents in a cluster becomes the final class label – The final precision is defined as the weighted sum of the precision for all clusters

25
Results and Analysis  Supervised Feature Selection – IG and CHI feature selection methods are performed – In general feature selection makes little progress on Reuters and 20NG – Achieves much improvement on Web directory dataset  Unsupervised Feature Selection – DF, TS, TC and En feature selection methods are performed – While 90% of terms removed, entropy is reduced by 2% and precision is increased by 1% – When more terms are removed, the performance of unsupervised methods is dropped quickly, however, the performance of supervised methods is still improved