A collection of sloppy snippets for scientific computing and data visualization in Python.

Wednesday, February 26, 2014

Terms selection with chi-square

In Natural Language Processing, the identification the most relevant terms in a collection of documents is a common task. It can produce meaningful insights about the data and it can also be useful to improve classification performances and computational efficiency. A popular measure of relevance for terms is the χ2 statistic. To compute it we can convert the terms of our document collection and turn them into features of a vectorial model, then χ2 can be computed as follow:

Where f is a feature (a term in this case), t is a target variable that we, usually, want to predict, A is the number of times that f and t cooccur, B is the number of times that f occurs without t, C is the number of times that t occurs without f, D is the number of times neither t or f occur and N is the number of observations.

Let's see how χ2 can be used through a simple example. We load some posts from 4 different newsgroups categories using the sklearn interface:

Now, X is a document-term matrix where the element Xi,j is the frequency of the term j in the document i. Then, the features are given by the columns of X and we want to compute χ2 between the categories of interest and each feature in order to figure out what are the most relevant terms. This can be done as follows

We can observe that the terms with a high χ2 can be considered relevant for the newsgroup categories we are analyzing. For example, the terms space, nasa and launch can be considered relevant for the group sci.space. The terms god, jesus and atheism can be considered relevant for the groups alt.atheism and talk.religion.misc. And, the terms image, graphics and jpeg can be considered relevant in the category comp.graphics.

I'm not sure I understand your question. However, I think that you just want to compute the occurrence of each term in the sentences that belong to a given class. This way you can have a insight about how important some terms for a specific class.