Java

Crawling the Web with Java

Are you playing with the possibilities of Java? This article explores in detail how to use Java's Web Crawler class and methods. It is excerpted from chapter six of The Art of Java, written by Herbert Schildt and James Holmes (McGraw-Hill, 2004; ISBN: 0072229713).

SearchCrawler starts off by declaring several instance variables, most of which hold references to the interface controls. First, the MAX_URLS String array declares the list of values to be displayed in the Max URLs to Crawl combo box. Second, disallowListCache is defined for caching robot disallow lists so that they don’t have to be retrieved for each URL being crawled. Next, each of the interface controls is declared for the Search, Stats, and Matches sections of the interface. After the interface controls have been declared, the crawling flag is defined for tracking whether or not crawling is underway. Finally, the logFileWriter instance variable, which is used for printing search matches to a log file, is declared.

The SearchCrawler Constructor

When the SearchCrawler is instantiated, all the interface controls are initialized inside its constructor. The constructor contains a lot of code, but most of it is straightforward. The following discussion gives an overview.

First, the application’s window title is set with a call to setTitle( ). Next, the setSize( ) call establishes the window’s width and height in pixels. After that, a window listener is added by calling addWindowListener( ), which passes a WindowAdapter object that overrides the windowClosing( ) event handler. This handler calls the actionExit( ) method when the application’s window is closed. Next, a menu bar with a File menu is added to the application’s window.

The next several lines of the constructor initiate and lay out the interface controls. Similar to other applications in this book, the layout is arranged using the GridBagLayout class and its associated GridBagConstraints class. First, the Search section of the interface is laid out, followed by the Stats section. The Search section includes all the controls for entering the search criteria and constraints. The Stats section holds all the controls for displaying the current crawling status, such as how many URLs have been crawled and how many URLs are left to crawl.

It’s important to point out three things in the Search and Stats sections. First, the Matches Log File text field control is initialized with a string containing a filename. This string is set to a file called crawler.log in the directory the application is run from, as specified by the Java environment variable user.dir. Second, an ActionListener is added to the Search button so that the actionSearch( ) method is called each time the button is clicked. Third, the font for each label that is used to display results is updated with a call to setFont( ). The setFont( ) call is used to turn off the bolding of the label fonts so that they are distinguished in the interface.

Following the Search and Stats sections of the interface is the Matches section that consists of the matches table, which contains the URLs containing the search string. The matches table is instantiated with a new DefaultTableModel subclass passed to its constructor. Typically a fully qualified subclass of DefaultTableModel is used to customize the data model used by a JTable; however, in this case only the isCellEditable( ) method needs to be implemented. The isCellEditable( ) method instructs the table that no cells should be editable by returning false, regardless of the row and column specified.

Once the matches table is initialized, it is added to the Matches panel. Finally, the Search panel and Matches panel are added to the interface.

The actionSearch() Method

The actionSearch( ) method is invoked each time the Search (or Stop) button is clicked. The actionSearch( ) method starts with these lines of code:

Since the Search button in the interface doubles as both the Search button and the Stop button, it’s necessary to know which of the two buttons was clicked. When crawling is underway, the crawling flag is set to true. Thus if the crawling flag is true when theactionsearch( ) method is invoked, the Stop button was clicked. In this scenario, the crawling flag is set to false and actionSearch( ) returns so that the rest of the method is not executed.

Next, an ArrayList variable, errorList, is initialized:

ArrayList errorList = new ArrayList();

The errorList is used to hold any error messages generated by the next several lines of code that validate all required search fields have been entered.

It goes without saying that the Search Crawler will not function without a URL that specifies the location at which to start crawling. The following code verifies that a starting URL has been entered and that the URL is valid:

Validating the maximum number of URLs to crawl is a bit more involved than the other validations in this method. This is because the Max URLs to Crawl field can either contain a positive number that indicates the maximum number of URLs to crawl or can be left blank to indicate that no maximum should be used. Initially, maxUrls is defaulted to –1 to indicate no maximum. If the user enters something into the Max URLs to Crawl field, it is validated as being a valid numeric value with a call to Integer.parseInt( ).Integer.parseInt( ) converts a String representation of an integer into an int value. If the String representation cannot be converted to an integer, a NumberFormatException is thrown and the maxUrls value is not set. Next, maxUrls is checked to see if it is less than 1. If so, an error is added to the error list.

For efficiency’s sake, a StringBuffer object (referred to by message) is used to hold the concatenated message. The error list is iterated over with a for loop, adding each message to message. Notice that each time a message is added, a check is performed to see if the message is the last in the list or not. If the message is not the last message in the list, a newline (\n) character is added so that each message will be displayed on its own line in the error dialog box shown with the showError( ) method.

Finally, after all the field validations are successful, actionSearch( ) concludes by removing “www” from the starting URL and then calling the search( ) method: