Story Discovery Using K-Means Clustering on News Articles

Now, perhaps more than ever, journalism has become instrumental in allowing us to understand current events. We rely on organizations to bring us the truth about the state of our country and world. However, with the rise of clickbait and fake news in this so-called “post-truth” era, it’s critical that our news diet is accurate and that we are not stuck in a content bubble, only consuming what the algorithms deem will generate the most revenue.

In this post, I attempt to present a method for aggregating and clustering news articles from different news sources. Hopefully, by finding articles covering the same story from different sources, the true nature of the event can be discovered and bias mitigated.

I was inspired to explore this idea from the great work done by Brandon Rose in his article Document Clustering with Python. Although I understand the clustering setup process, he does a much better job in explaining the parts and I highly recommend you read that article if you want a more in-depth look at document clustering.

Gathering Articles from Sources

First, let’s start by gathering information about our sources. Our goal will be to eventually extract current and past articles from the site in a reusable and efficient way. We could web scrape each site for the current stories, but this process would be extremely hard to effectively code given the HTML structure of each news source is vastly different.

So let’s think. Is there a way to find currently published articles by a news source that would lend itself to efficient, reusable code? As it turns out, there is! We can take advantage of the fact that almost all news sources have active RSS feeds broadcasting the most recent articles from a range of topics.

To further aid us, Kurt McKee and Mark Pilgrim put together feedparser, a library specifically for extracting the information from an RSS feed. Let’s start by testing this using Reuters.

Fantastic! The RSS feeds provide a plethora of information but as we want to cluster articles by story, we will focus on gathering the links to the articles.

But how might we gather news from yesterday, last month, or even a year ago? For this we can turn to the incredible resource that is archive.org. Their site provides an endpoint that allows you to query their website archives by date, which is exactly what we need.

We also need to get the raw text of the articles. For this we can use the amazing newspaper library created by @codelucas. The library has tons of options for streamlined web scraping of articles but we will just use it to grab the raw text.

importpandasaspdimportnewspaperrss_urls=["http://rss.cnn.com/rss/cnn_allpolitics.rss","http://rss.nytimes.com/services/xml/rss/nyt/Politics.xml","http://feeds.washingtonpost.com/rss/politics","http://feeds.foxnews.com/foxnews/politics","http://feeds.reuters.com/Reuters/PoliticsNews"]articles=[]forrss_urlinrss_urls:feed=feedparser.parse(rss_url)# find all article linksarticle_links=[]forentryinfeed['entries']:article_links.append(entry['link'])# get text for each articleforlinkinarticle_links:article=newspaper.Article(link)article.download()article.parse()articles.append({"url":article.url,"title":article.title,"text":article.text})articles=pd.DataFrame(articles)

articles.head()

text

title

url

0

Washington (CNN) President Donald Trump excori...

Trump unloads on former top aide Bannon

http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/h...

1

The idea that a meeting between the three top ...

Steve Bannon is 100% right about Russia and th...

http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/Z...

2

Story highlights Tina Smith replaces Sen. Al F...

Two senators sworn into office amid #MeToo mov...

http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/l...

3

(CNN) The White House on Wednesday released a ...

Read Trump's official statement about Steve Ba...

http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/Y...

4

Story highlights The book, "Fire and Fury: Ins...

Bannon: 2016 Trump Tower meeting was 'treasonous'

http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/L...

For different dates, simply change the url passed to feed parser to include the archive.org base url and the date for which you want to query.

Article Clustering

With articles now in hand, we can start clustering. Again, as mentioned above, check out the wonderful post by Brandon Rose on document cluster for a better insight into the steps I’m taking to cluster the articles.

First, let’s start by appending contractions to the current list of stopwords.

Here, I’ve combined Rose’s two tokenizing and stemming functions into one. Providing the option to not stem the words is important given that after clustering, we will want to convert the stems back to their full words.

importnltkimportredeftokenize_and_stem(text,do_stem=True):# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own tokentokens=[word.lower()forsentinnltk.sent_tokenize(text)forwordinnltk.word_tokenize(sent)]# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)filtered_tokens=[]fortokenintokens:ifre.search('[a-zA-Z]',token):filtered_tokens.append(token)# stem filtered tokensstems=[stemmer.stem(t)fortinfiltered_tokens]ifdo_stem:returnstemselse:returnfiltered_tokens

Using that function, we can create two vocabulary lists: one stemmed and one only tokenized.

# not super pythonic, no, not at all.# use extend so it's a big flat list of vocabtotalvocab_stemmed=[]totalvocab_tokenized=[]foriinarticles['text']:allwords_stemmed=tokenize_and_stem(i)totalvocab_stemmed.extend(allwords_stemmed)allwords_tokenized=tokenize_and_stem(i,False)totalvocab_tokenized.extend(allwords_tokenized)

From the vocabulary lists, we can make what is a essentially a lookup table for stems to full word. However, since stems can refer to multiple words, this lookup table will only return the first viable word.

In my implementation, instead of hard-coding a predetermined number of clusters, I chose to use a slightly modified version of the best method I could find of estimating the amount of clusters that will appear in your dataset.

Conclusion

Looks like we generated some pretty interesting clusters! As you can see, some are better than others but, solely based on the titles, K-Means seems to have separated articles into relatively coherent clusters.

I would definitely like to explore more methods of creating better clusters:

remove outliers from clusters

more data specific method of determining amount of clusters

test other clustering algorithms

Finally, if anyone is interested, I have created a simple web app that updates hourly with news clusters from the current day, last week, and last month.