Introduction

I’m currently enrolled in a four-course Machine Learning Specialization offered through Coursera and taught by researchers at the University of Washington. So far I’ve found the classes and assignments to be worthwhile and well-organized. While the courses teach machine learning algorithms and concepts in a way that is independent of any particular programming language or package, the specific tools are ultimately implemented using the Graphlab library in Python. Graphlab seems to be fairly easy to use, and it also has the advantage of being able to scale out-of-core (rather than execute implementations completely in memory). However, unlike the other popular machine learning library in Python, sckit-learn, Graphlab is a commercial software. If an effort to stay with open-source tools, I’ve been completing the assignments using both scikit-learn and Graphlab.

In the following post, I use scikit-learn to demonstrate how feature extraction and k-nearest neighbor algorithms can be used to compare the similarity of the text in Wikipedia articles, ultimately to find articles that are closely related to one another.

Feature Extraction from Text Documents

Word Count Vectors

One way to represent a text document is to count the instances of every word in the document. Then articles can be compared based on how similar their word count vectors are. Below, I use the CountVectorizer in scikit-learn to generate a vector of word counts for the Wikipedia text on Barack Obama. I then plot the ten most frequently occuring words. As you can see, the most common words tend to not be unique to Barack Obama, but are just commonly used words (the, in, and of). Therefore, these are probably not very helpful in identifying articles that are similar.

name='Barack Obama'#Extract text for a particular persontext=wiki_data[name]#Define the count vectorizer that will be used to process the datacount_vectorizer=CountVectorizer()#Apply this vectorizer to text to get a sparse matrix of countscount_matrix=count_vectorizer.fit_transform([text])#Get the names of the featuresfeatures=count_vectorizer.get_feature_names()#Create a series from the sparse matrixd=pd.Series(count_matrix.toarray().flatten(),index=features).sort_values(ascending=False)ax=d[:10].plot(kind='bar',figsize=(10,6),width=.8,fontsize=14,rot=45,title='Barack Obama Wikipedia Article Word Counts')ax.title.set_size(18)

TF-IDF Vectors

Term frequency-inverse document frequency is a statistic that reflects how important a word is to a specific document relative to all of the words in a collection of documents (the corpus). The tf-idf value increases proportionally to the number of times that word appears in the document, but is offset by the frequency of the word in the corpus.

Below, I use a TF-IDF vectorizer to determine the relative importance of the words in the Barack Obama text. Unlike the count vectorizer, this one requires the entire corpus as its input, so it takes a little bit longer to run. As you can see, the words with the highest TF-IDF values are now a bit more specific to Barack Obama, including words such as “act”, “law”, and “iraq”.

#Define the TFIDF vectorizer that will be used to process the datatfidf_vectorizer=TfidfVectorizer()#Apply this vectorizer to the full dataset to create normalized vectorstfidf_matrix=tfidf_vectorizer.fit_transform(wiki_data)#Get the names of the featuresfeatures=tfidf_vectorizer.get_feature_names()#get the row that contains relevant vectorrow=wiki_data.index.get_loc(name)#Create a series from the sparse matrixd=pd.Series(tfidf_matrix.getrow(row).toarray().flatten(),index=features).sort_values(ascending=False)ax=d[:10].plot(kind='bar',title='Barack Obama Wikipedia Article Word TF-IDF Values',figsize=(10,6),width=.8,fontsize=14,rot=45)ax.title.set_size(20)

Using K-Nearest Neighbors to Identify Similar Articles

Now, we have our TF-IDF matrix, which is a sparse matrix of TF-IDF values for every word in every one of the Wikipedia articles. The NearestNeighbors function can be fit on this sparse matrix and will determine the distance between articles based on their TF-IDF values. I specify that I want to return the 10 nearest neighbors for each article. By default, this function uses Minkowski distance, but other common distance metrics (Euclidean, Cosine, Manhattan) can be specified as well.

Below I write a function that takes a name as its input and returns a Series of the 10 nearest neighbors along with their distances.

Conclusion

I hope this was helpful in demonstrating how a simple text feature-extraction tool can be used with a k-nearest neighbors algorithm to find similar text documents. There are number of ways in which this model could potentially be improved, such as the inclusion of n-grams in the text vectorizer. N-grams are contiguous sequences of n-words. While this model only included single words (unigrams), it could be expanded to include bigrams, trigrams, and so on, which would potentially provide a more complete set of features in identifying text similarity.