Monday, 12 May 2014

deduplication and scrapy

I read several articles on near-deduplication and had an idea based on some of the algorithms previously used. Outline:

For each article, hash each sentence, and maintain table of hashed_sentences:articles_containing_sentence[]

Then, duplicates and near duplicates can efficiently be discovered and avoided with something along the lines of the following

new_article = crawl_url(url)

duplicate_possibilites = []

sentences = get_hash_sentences(new_article)

for sentence in sentences:

duplicate_possibilities += hashed_sentences[sentence]

It is then pretty straightforward to fetch the text of all existing articles which have more than some percent overlap of sentences with the new article, and to use text similarity algorithms in pairs on these articles. Alternatively, the sentence-overlap percentage could be enough to identify a new article as a 'duplicate' or not.

The sentence:article table could become undesirably large, but the size could be reduced with some heuristic selection of which sentences are 'important'. (containing at least some uncommon words, not too long or too short, etc).

I also wrote a basic IOL Spider for Scrapy, and started experimenting with using this to fetch old IOL data (ie, articles published before we started watching the RSS feeds.)