27 March 2012

Extracting meaningful text from webpages

I was trying to extract the meaningful text from a webpage for a given URL for crowl. For example if I visit any news site for a particular article, I will find a lot of crap (clutter) with the news text, this includes: ads, related news stories, top news stories, comments on the article, other web site links and much more.
Lets take an example of this The Times of India article:

The useful text in the The Times of India article has around 30% share of total content, the remaining 70% is the clutter. You may argue that you need those links related to most popular stories, related stories etc. But sill a lot of extra stuff is there which we really don't care about. (Meaningful) Information extraction from such a page is a big nightmare. We can start with getting the HTML source and stripping the HTML tags from the text. Using regular expressions, lets remove all the links too. The resultant content will look like:

83-year-old woman sues Apple for $1m - The Times of India | The Times of India | | More More ADVERTISEMENT Hardware The Times of India The Times of India Indiatimes Web (by Google) Video Photos You are here: » » » Hardware Breaking News: 83-year-old woman sues Apple for $1m The writer has posted comments on this articleANI | Mar 26, 2012, 04.42PM IST My Saved articles Read more:||||||| SHARE AND DISCUSS NEW YORK: An 83-year-old American woman has sued for 1 million dollars after she failed to see the glass door at the tech giant's office and smashed her face. Evelyn Paswall, a former Manhattan fur-company vice president, went to to return an on December 13. While approaching the store, Paswall didn't realize she was heading straight for a wall of glass. She smashed her face against it, breaking her nose, Paswall claims in her suit filed in the US Eastern District federal court. Now the Forest Hills, Queens, resident, Paswall claimed in her lawsuit that the company was negligent not elderly-proofing the store's see-through fa ade, The New York Post reports. She argues that Apple should have put marks on the glass that older people could spot before they come face-to-face with disaster. "The defendant was negligent . . . in allowing a clear, see-through glass wall and/or door to exist without proper warning," Paswall suit said. Hi ! Do you like this story? My saved articles RELATED COVERAGE Articles Blogs LATEST NEWS » ......

As you can observe the above text has a lot of extra text which we don't want. Attempts have been made to get extract the main content, here is one such article: How to Extract a Webpage’s Main Article Content
The Java program to get the above text: (Jsoup can be downloaded from here)

The best Java lib I could find to get the main text from a web page was boilerpipe, and the same can be tested here. It does a pretty good job of removing the clutter around the meaningful text. Running the The Times of India news article link through boilerpipe gives the following text:

Tweet
NEW YORK: An 83-year-old American woman has sued Apple for 1 million dollars after she failed to see the glass door at the tech giant's office and smashed her face.
Evelyn Paswall, a former Manhattan fur-company vice president, went to Apple's Manhasset store to return an iPhone on December 13.
While approaching the store, Paswall didn't realize she was heading straight for a wall of glass.
She smashed her face against it, breaking her nose, Paswall claims in her suit filed in the US Eastern District federal court.
Now the Forest Hills, Queens, resident, Paswall claimed in her lawsuit that the company was negligent not elderly-proofing the store's see-through fa ade, The New York Post reports.
She argues that Apple should have put marks on the glass that older people could spot before they come face-to-face with disaster.
"The defendant was negligent . . . in allowing a clear, see-through glass wall and/or door to exist without proper warning," Paswall suit said.
Hi !

The above text is very close to what we want. Boilerpipe library is based on this paper. By combining Jsoup (to get the page title) with boilerpipe (to get the page content) we can get the meaningful content from a webpage.