Java

Crawling the Semantic Web, concluded

This article, the second of two parts, examines the problems raised by the glut of information available through the web, and how to tame it. It is excerpted from the book Wicked Cool Java, written by Brian D. Eubanks (No Starch Press, 2005; ISBN: 1593270615).

We just showed how Informa can retrieve data from an RSS channel, using the ChannelBuilder class. Ideally, updating your copy of the feed should be an automated process, and Informa can also do this. The Poller class (located in the de.nava.informa.utils.poller package) can periodically poll a Channel objectís RSS feed and trigger some action whenever there are changes. By default, this polling occurs every 60 minutes but can be configured to use longer or shorter periods. The Poller class works by notifying an observer object whenever something changes in the feed. To use this process, you must first create a class implementing the PollerObserverIF interface. This interface has methods for poll tracking, error handling, and feed change notification.

Letís look at an example of a PollerObserverIF that uses the newItem method, which the Poller calls whenever the feed has a new item. However, the new item will not be added to the copy in your Channel object unless the observer explicitly adds it. Here is a PollerObserverIF implementation that does not add feed changes to the Channel object but instead prints a notification message to the console:

This observer will print information about the beginning and end of each polling event, list any new items in the feed, and add new items to the object model. Warning: An observer does not add new items to the Channel object unless you explicitly call the addItem method. If you have more than one observer attached, one of them should be assigned the task of adding the new item to the Channel. With real RSS feeds, youíll want to set a polling frequency that doesnít clog the network or the site with unnecessary traffic. A polling period of 60 minutes (the default) or longer should be frequent enough for most sites. The following code fragment uses the observer that we just defined and polls the RSS feed for a previously loaded Channel object every 60 minutes.

Make sure to remember that the polling interval is specified in milliseconds! If you are going to filter items from the feed, the observers should not be doing the filtering. There is a separate component that can approve polled changes prior to observer notification. This keeps the observers focused on their task of propagating changes rather than filtering data. The process is more scalable that way, as you may want many observers to receive approved changes. This filtering and approval process is described in the next section.