Project 1: Subject-Specific Web Crawler

NOTE: CHANGES MADE

I have made some changes below as to the criteria that you
should use in choosing links to follow, and in the form of the
input.

Assigned: Sept. 13
Due: Oct. 4

The object of this project is to write a crawler that tries to download
pages dealing with a particular subject.

Specification

Inputs:

A URL from which to start the crawl (presumably, a web page that
deals with the subject).

A query word, or a set of query words.

A maximum number of pages to download. This should default to 50.

CHANGED: A "window" size. This is explained below.

Output: An HTML page with

Links to all the pages that have been downloaded.

A list of the 20 words that appear in the largest number of downloaded
pages.

For each downloaded page, a count of the number of in-links from
other downloaded pages.

CHANGED:
In choosing to download a link L from file F, your program should consider

Whether the query word (or words) appears in the URL of L;

Whether the query word appears in the anchor text of L;

Whether the query word appears within W words of L in F,
where W is the window size.

Generally, it's a good rule to follow a link labelled "Links".

How you want to combine these is up to you. Among links that score equally
according to the above criteria, the crawler should choose the link
closest to the starting page. (For example, if no links satisfy any
of the above criteria, then the crawler should just do a breadth-first
search.)

Extra credit (NEW)

For extra credit, you may implement any of the following features:

Implement stemming. That is, check whether the stem of the query word
matches the stem of the word in F.

Implement synonymy. That is, using an online thesaurus, check
whether any synonym of the query word is a synonym of the word in F.

Use the HTML structure of F to get a judgment of whether
a word in F is connected to the link L.

After downloading a page, evaluate its relevance to the query.
Prefer to search outward from relevant pages.

Deliverables

Email to the TA Zhongshan Zhang (zhongsha@cs) and to me (davise@cs.nyu.edu):

A listing of your program.

Instructions how to run your program.

Citation of any external resources that you have used.

A report. This should contain:

A description of the criteria your program uses

CHANGED:
The results of a set of three experiments, stating (a) the starting URL;
(b) the subject query; (c) the precision achieved --- that is,
the fraction of pages downloaded that that are relevant to the subject.
You should judge "relevance" by personal inspection. All the experiments
in this set should use the default values for the window size and
number of downloads.

CHANGED:
The results of a second set of three experiments, all using the same
starting point and subject, but differing in the assigned window size.

Optionally, any further interesting features of your program;
interesting problems you encountered in doing the project;
observations about the projects; etc.

You should choose your experiments so that a simple breadth-first
search will do badly, but it would be possible to
do well.
For instance, an experiment which started from a hub page with fifty
links to pages all of which were relevant to the subject would not be
a good experiment: too easy. An experiment that chose a subject
that is discussed in only one page on the Web would not be a good experiment:
too hard.

Some examples

I have here some examples of subjects and
links. Your experiments must include at least one example that is
not on this list, and that is not being done by other students.

I find that, on the whole, good subjects for this kind of experiment
tend to be subjects in which there is a lot of interest by amateurs.

Electronic resources

In general, you may use any suitable electronic resources that you find on
the Web. As mentioned above, these must be cited in your report. You
use any of these at your own risk; neither the TA nor I will help you
with problems you have with any of these, except the code that
I've provided myself.

If you should happen to find on the web something that fits this assignment
exactly , let me know.

Crawlers

There is all kinds of code for crawlers on the Web, which you
may use.
As a starting point, I have written a minimal
Web Crawler in Java.
You can also look at
the code described in
Programming Spiders, Bots, and Aggregators in Java
by Jeff Heaton, chapter 8. (Note: This is accessible online for free through
an NYU account. You can also buy it in hard-copy, with a CD-ROM.)
If you feel ambitious,
you could try working with the
SPHINX
package, which is a lot larger and more complex.

Natural Language tools

Your program may use natural language tools such as online therauruses
etc. I have here a list of stop words,
which may be useful.

CRAWLER COURTESY: VERY IMPORTANT

Your crawler MUST always have a fixed upper bound on the total
number of files to be downloaded. In developing, testing, and debugging,
this number should be kept as fairly SMALL.

So as not to overload the system, the system support staff asks that
you should not do your work on the servers such as slinky, spunky,
griffin, etc. Rather, you should run it on the public machines (e.g.
the pubsuns or on the course machines (courses1 - courses6). If you
have an office, you can run it on the workstations in your office.

Robustness and Efficiency

You need not deal with the kind of robustness and efficiency issues
that we discussed in class:

You need only download HTML files and only via HTTP (not HTTPS).

You can assume that
links in HTML files
have the form

&lt a href="URL"&gt

with no white space or weird characters in the middle of the URL.
You don't have to worry about the case where this appears in the
middle of some other HTML tag (such as a comment).

You can assume that there are no errors in the input. You want your code to
be reasonably robust under errors in the Web pages you're searching.
If an error is
encountered, feel free, if necessary, just to skip the page where it is
encountered.
Don't worry about memory constraints; if your program runs out of space and
dies on encountering a large file, that's OK. You do not have to use
multiple threads; sequential downloading is OK.

Group and Variant Projects

I'm open to suggestions. You can do a group project, but it will have
to be proportionally bigger than the project described here. If you
have an idea for a crawler project that you think would be more worthwhile
or fun for you than this one, feel free to propose it.

Grading

A program that achieves the functionality specified here, but is
poorly coded, uncommented, with poorly chosen experiments, and
an inadequate report, will get 70/100. To get 100/100, the program
must achieve the functionality, be well coded, well commented, with
well-chosen experiments, and a good report.

Late Policy

"On time" means at class time on the due date.
Programs submitted late will get a penalty of 2 points out of 100 per day late,
up to a maximum of 20 points.