Secrets of Contextual Analysis

I'm analyzing the content of some documents in order to find potential correlations between them. Breaking each document into individual words, stemming those words, and throwing out the stopwords gave me some 18,000 unique words from a 600-document corpus, with over 40% of words appearing only once in the corpus and almost 80% of the words appearing fewer than ten times.

I knew my existing list of stop words was insufficient, but I really don't want to pick out the top 1000 or 2000 useful words from a list of 18,000, especially because this is a test corpus of perhaps 7% of the actual corpus.

Now I start to wonder if some of the lexical analysis modules would be useful in picking out only the nouns (unstemmed) and verbs (stemmed) from a document, rather than taking all of the words of a document as significant. The correlation algorithm appears sound, but if I can throw out lots of irrelevant data, I can improve the performance and utility of the application.

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Without JavaScript enabled, you might want to
use the classic discussion system instead. If you login, you can remember this preference.

Please Log In to Continue

What sort of correlations are you looking for? I was focusing on detecting plagiarism [perlmonks.org] at one point and found that breaking things down by sentence was more useful. As I don't know what you're trying to do, I've no idea if that link will prove useful.

Detecting plagiarism is much more specific than this problem. I want to be able to analyze a document and suggest a handful of other documents that, from their intertextual context at least, appear to discuss similar things. For example, a tutorial about creating homemade pizza dough is probably not very similar to a journal entry about linguistic analysis, but probably is similar to an article discussing different types of pizza ovens.

I see. That makes sense. Perhaps a heuristic approach is best as there are few algorithms likely to realize that "July heat wave" and "dog days of summer" might be related, though when the text is long enough idiomatic expressions are likely to come out in the wash.

My initial thought would be to try to score words in documents. Take the words that appear the most frequently and somehow correlate their frequency in the document by their infrequency in the language. Thus, the least common words which ap

You could use Ted Pedersen's Wordnet::Similarity modules. This attaches a numerical value to any two words and can help you identify which words are related, and how closely. I prefer jcn (the Jiang Conrath method) myself, but there are 10 different techniques on offer.

Also, would it not make sense to use a POS (Part of Speech) tagger before you break down stop words and so on ? I can't recommend a Perl based POS tagger offhand, since most of my work in this area is done in Java... but I'm pretty sure the

There are many solutions in information retrieval. You can compute a cosine measure between a interesting document and each document of the corpus.
You can too train a bayesian network to categorize your documents.
Some references :
- Information Retrieval / C. J. van RIJSBERGEN . - http://www.dcs.gla.ac.uk/Keith/Preface.html [gla.ac.uk] (and specificaly the 3rd chapter : http://www.dcs.gla.ac.uk/Keith/Chapter.3/Ch.3.html [gla.ac.uk])
- Bayesian Analysis For RSS Reading / Simon Cozens, in The Perl Journal, March 2004
- Building a