README.md

Goose - Article Extractor

Intro

Goose was originally an article extractor written in Java that has most recently (aug2011) converted to a scala project. It's mission is to take any news article or article type web page and not only extract what is the main body of the article but also all meta data and most probable image candidate.

The extraction goal is to try and get the purest extraction from the beginning of the article for servicing flipboard/pulse type applications that need to show the first snippet of a web article along with an image.

Regarding the port from JAVA to Scala

Gravity has moved more towards Scala development internally so maintenance started to become an issue

There wasn't enough contribution to warrant keeping it in Java

The packages were all namespaced under a person's name and not the company's name

Scala is more fun

Issues

It was a pretty fast Java to Scala port so lots of the nicities of the Scala language aren't in the codebase yet, but those will come over the coming months as we re-write alot of the internal methods to be more Scalesque.
We made sure it was still nice and operable from Java as well so if you're using goose from java you still should be able to use it with a few changes to the method signatures.