VisitedLinks: An auxiliary Service for crawler workers

The VisitedLinks service helps crawlers to track which links in the data source have already been visited. This is necessary for source where the link graph is not a simple tree (as in a filesystem usually) but can have meshes or even cycles. For example, when crawling a web site, some pages will be linked from many (even from all) other pages, and pages can refer to each other, but they must not be crawled over and over again in order to prevent duplicates or endless loops. The VisitedLinks service can keep track of the information even when the crawler worker is running on multiple nodes in a SMILA cluster.

The usage is relatively simple, the crawler just needs to call a single method:

the link was visited in this job run while processing a link bulk with the same id. Usually an input bulk is processed twice only if a first try failed for some reason (e.g. the process or machine crashed), so if a worker processes the same input bulk again, it's quite sure that the first processing has failed and the link is not really crawled.

In this case the service updates the entry for the link and the crawler should continue to crawl the link. However, because checking and updating the entry in the service may not be completely atomic, the crawler should check again a bit later (before actually writing records to output bulks) if the link has still not been visited by another task. Just repeat the same isVisited call as before.

Otherwise, the method return true, i.e. the link was visited in the same job run, but read from a different input bulk. In this case the crawler should just drop the link.

See the WebCrawlerWorker for an example of how to use this service.

ObjectStoreVisitedLinks service implementation

The bundle org.eclipse.smila.importing.state.objectstore provides an implementation of the VisitedLinks service using the ObjectStore service in a similar way as the ObjectStoreDeltaService to keep track of the visited state of links.

The service uses store visitedlinks.

Configuration

As the ObjectStoreVisitedLinks service shares most of its code with the ObjectStoreDeltaService it also has the same configuration properties as the delta service. The only difference is that they are read from org.eclipse.smila.importing.state.objectstore/visitedlinksstore.properties.

VisitedLinks ReST API

Currently there is only a simple REST API for VisitedLinks that allows to see for which data source how many entries have been stored and to delete all entries of a single source or all entries or all sources.

Clear all sources

Get info about sources

URL: /smila/importing/visitedlinks/<sourcename>

Method: GET

Response Code:

200 OK, if successful,

404 NOT FOUND, if source does not have entries currently.

Response JSON:

Contains the ID of the source and the number of entries. If there are more than 10000 entries, the number is only estimated because exact counting could take a long time. To force an exact count, add ?countExact=true to the request URL.