Discover Data Files on Hard Disk

Vectorize all Articles

To reduce memory use, I wrote the following method that returns an iterator over all article bodies.
In passing this iterator to the vectorizer, we avoid loading all articles into memory at once - despite
the use of an iterator here, I have not been able to repeat this experiment with all 65,000-odd PLoS ONE
articles without running out of memory.

The vectorizer generated a matrix with 139,748 columns (these are the tokens, i.e. probably unique words used in
all 1754 PLoS Biology articles) and 1754 rows (corresponding to individual articles).

tfidf.shape

Let us now compute all pairwise cosine distances betweeen all 1754 vectors (articles) in matrix tfidf.
I copied and pasted most of this from a StackOverflow answer that I cannot find now - I will
add a link to the answer when I come across it again.

To get the ten most similar articles, we track the top five pairwise matches.