Use web service APIs along with these tools and techniques to construct your own hybrid search bots and automate your web data-collection tasks.

WEBINAR:

On-Demand

Finding the Correct Birth Year
When the program finishes scanning the URLs identified with a famous person, you are left with a list of potential birth years. The program also tracks how many times each potential birth year occurs. It calls the getResult function to determine which year had the largest number of "votes."

The function begins by creating two variables. The result variable holds the year with the largest count. The second variable, named maxCount, holds the number of votes held by the current value of the result variable.

int result = -1;
int maxCount = 0;

Bots are usually designed to access specific data. If you need to obtain data, and that data is available on the Internet, you can probably construct a bot to obtain it.

Next, it creates a Set that contains each birth year, and counts the occurrences of each. At the end, the result variable will hold the birth year with the largest count:

If no birth years were found, then the result variable remains set to its initial value of -1, which informs the calling method that no birth year was found.

Going on From Here
This article showed you how to create a bot that makes use of the Yahoo web services API. This bot uses the Yahoo API to find likely pages to visit. Subsequently, it uses regular Java HTTP programming to access and analyze the data contained in those pages.

Bots are usually designed to access specific data. If you need to obtain data, and that data is available on the Internet, you can probably construct a bot to obtain it. Using the Java HTTP functions a Java program can perform any task that a regular web user would. Creating the bot is simply a matter of reproducing the correct HTTP requests in your bot and writing the appropriate code for data recognition, extraction, and analysis.

Fortunately, as you have seen, much of the codethe initial search, URL gathering, HTML stripping, sentence collection, and word tokenizingis boilerplate; you'd write the same type of code to search for any type of data. That also means it's reusable. The only part that's not reusable is the code that identifies and analyzes the specific data you're looking for. By replacing that code with your own custom code, you have all the basic tools you need to construct your own bots to search and extract data from the web.

Jeff Heaton is an author, college instructor, and consultant. Jeff is the author of four books and over two dozen journal and magazine articles. Jeff maintains a personal website where he publishes information about artificial intelligence, spider/bot programming, and other topics.