WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically.

WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.

Crawler Workbench

The Crawler Workbench is a graphical user interface that lets you configure and control a customizable web crawler. Using the Crawler Workbench, you can:

Visualize a collection of web pages as a graph

Save pages to your local disk for offline browsing

Concatenate pages together for viewing or printing them as a single document

Extract all text matching a certain pattern from a collection of pages.

Develop a custom crawler in Java or Javascript that processes pages however you want.

WebSPHINX class library

The WebSPHINX class library provides support for writing web crawlers in Java. The class library offers a number of features: