A collection of sloppy snippets for scientific computing and data visualization in Python.

Wednesday, September 24, 2014

Text summarization with NLTK

The target of the automatic text summarization is to reduce a textual document to a summary that retains the pivotal points of the original document. The research about text summarization is very active and during the last years many summarization algorithms have been proposed. In this post we will see how to implement a simple text summarizer using the NLTK library (which we also used in a previous post) and how to apply it to some articles extracted from the BBC news feed. The algorithm that we are going to see tries to extract one or more sentences that cover the main topics of the original document using the idea that, if a sentences contains the most recurrent words in the text, it probably covers most of the topics of the text.
Here's the Python class that implements the algorithm:

The FrequencySummarizer tokenizes the input into sentences then computes the term frequency map of the words. Then, the frequency map is filtered in order to ignore very low frequency and highly frequent words, this way it is able to discard the noisy words such as determiners, that are very frequent but don't contain much information, or words that occur only few times. And finally, the sentences are ranked according to the frequency of the words they contain and the top sentences are selected for the final summary.

To test the summarizer, let's create a function that extract the natural language from a html page using BeautifulSoup:

----------------------------------
BBC News - Scottish independence: Campaigns seize on Scotland powers pledge
* Speaking ahead of a visit to apprentices at an engineering firm in
Renfrew, Deputy First Minister Nicola Sturgeon said: Only a 'Yes' vote will
ensure we have full powers over job creation - enabling us to create more
and better jobs across the country.
* Asked if the move smacks of panic, Mr Alexander told BBC Breakfast:
I don't think there's any embarrassment about placing policies on the
front page of papers with just days two go.
----------------------------------
BBC News - US air strike supports Iraqi troops under attack
* Gabriel Gatehouse reports from the front line of Peshmerga-held territory
in northern Iraq The air strike south-west of Baghdad was the first taken as
part of our expanded efforts beyond protecting our own people and humanitarian
missions to hit Isil targets as Iraqi forces go on offence, as outlined in the
president's speech last Wednesday, US Central Command said.
* But Iran's Supreme Leader Ayatollah Ali Khamenei said on Monday that the US
had requested Iran's co-operation via the US ambassador to Iraq.
----------------------------------
BBC News - Passport delay victims deserve refund, say MPs
* British adult passport costs Normal service - £72.50 Check Send -
Post Office staff check application correct and it is sent by Special Delivery
- £81.25 Fast-Track - Applicant attends Passport Office in person and passport
delivered within one week - £103 Premium - Passport available for collection
on same day applicant attends Passport Office - £128 In mid-June it announced
that - for people who could prove they were booked to travel within seven days
and had submitted passport applications more than three weeks earlier - there
would be a free upgrade to its fast-track service.
* The Passport Office has since cut the number of outstanding applications to
around 90,000, but the report said: A number of people have ended up
out-of-pocket due to HMPO's inability to meet its service standard.
----------------------------------
BBC News - UK inflation rate falls to 1.5%
* Howard Archer, chief UK and European economist at IHS Global Insight,
said: August's muted consumer price inflation is welcome news for consumers'
purchasing power as they currently continue to be hampered by very
low earnings growth.
* Consumer Price Index (CPI) inflation fell to 1.5% from 1.6% in August,
the Office for National Statistics said.
----------------------------------
BBC News - Thailand deaths: Police have 'number of suspects'
* The BBC's Jonathan Head, on Koh Tao, says police are focussing on the
island's Burmese community BBC south-east Asia correspondent Jonathan Head
said the police's focus on Burmese migrants would be quite controversial as
Burmese people were often scapegoated for crimes in Thailand.
* By Jonathan Head, BBC south-east Asia correspondent The shocking death of
the two young tourists has cast a pall over this scenic island resort Locals
say they can remember nothing like it happening before.

Of course, the evaluation a text summarizer is not an easy task. But, from the results above we note that the summarizer often picked quoted text reported in the original article and that the sentences picked by the summarizer often represent decent insights if we consider the title of the article.

80 comments:

Hello, this is an amazing program, can you tell me on which algorithm this is based on ? or is it your own ? why did you choose min_cut =0.1 , and max_cut=0.9 , i mean to say that, why can't 0.2 be the lower cut off to declare that the word is not required ?

Are you aware of any summarisation algorithm for online reviews? It would be different from news summaries as one needs to first identify the product features mentioned, then the sentiment associated with that feature.

You're looking for a cross document summary. You could use this algorithm giving it all the reviews in input. Of course, it is very naive, but the high frequency terms should be given by the product features mentioned by the reviewers.

Oh okay thanks a lot I'll definitely try playing around with the code. And also I'd love it if you direct me to some papers or books that helped you write the code. I'm just a beginner and I want to explore more.

Research papers are produced now in such mass quantity, and often universities and researchers are doing the same types of studies getting similar results, and each of their papers is published in a separate or different journal. online text summarizer

I think this post is really worth a try. I like to thank you for sharing this information. One of this days I will use your tips and see if I come up with much better result if not close to what you have been doing right now.

Thanks - this code is great. Only one strange thing for me - if I try the code as it is, I get an error:

in _compute_frequencies for w in freq.keys():RuntimeError: dictionary changed size during iteration

I'm new to Python, but I understand the error - the code is trying to modify the thing that control the iteration during the loop. Why don't others have this problem? If I take out that 'del freq[w]' it works, but obviosuly without the 'cut' functionality. Any ideas? I'm using Python3.Thanks again for the code, though.

Thank you, so this core algorithm is term frequencies not TF-IDF algorithm ? sorry i just rather confused. i ever heard there are some algorithm like sentences clustering, lexical chain or TF-IDF algorithm. is the algorithm you made not both of them?

this is an amazing program, but is it really important to determine the parameters? You choose the min_cut =0.1 , and max_cut=0.9. if the parameter checking was not used in the program, is it wrong? or make a problem for the summarization?

Hello again Friday, these parameters simply filter words with too high or too low frequencies. If you think that they're not a problem for your input data, you could simply set them to 0. It really depends on the language you're dealing with and if you applied some other types of normalization before using this algorithm.

I am using your program to summarize a set of disaster responses.It is repeating same sentence multiple time and I am asking for 2 line summary,it is giving me more that two lines.What is the reason behind it?

Good work. The same algo is used by many online summary tools. I found the results similar. But there are better algorithms available. Full marks to you for implementing this algo and sharing with people.

Can i get a more detailed explanation as to how this can be applied to other webpages that are not RSS feeds? What is 'guid'? When I try to find_all('p'), it seems to be reading just the first paragraph on that page.

Hi Flavour, guid is a tag used in the xml code of the feed which contain the url of the article. When you run final_all('p') make sure you're iterating over the results of the function as it should return all the paragraphs in the html you're working with. It's to advise on what to do if you're data is not in rss as the way you need to retrieve it strongly depends on the format the input data is.

It seem that many of you have the urllib2 error generally in python3: The simple solution to this i found as:import urllib.request#andreq = urllib.request.Request(url)page = urllib.request.urlopen(req).read().decode('utf8')soup = BeautifulSoup(page,"html5lib")

The another thing is to mention "html5lib" in BeautifulSoup to remove parser error.Others are seem to correct