People.eng.unimelb.edu.au

Focused crawling in depression portal search: A feasibility studyAbstract
deal of depression information on the Web is of poor
search services in the area of depressive illness
quality when judged against the best available scientific
has documented the significant human cost required to
evidence [8, 10]. It is thus important that consumers can
setup and maintain closed-crawl parameters. It also
locate depression information which is both relevant
showed that domain coverage is much less than that ofwhole-of-web search engines. Here we report on the
Recently, in [15], we compared examples of two
feasibility of techniques for achieving greater coverage
types of search tool which can be used for locating
at lower cost. We found that acceptably effective crawl
depression information: whole-of-Web search engines
parameters could be automatically derived from a
such as Google, and domain-specific (portal) search
DMOZ depression category list, with dramatic saving
services which include only selected sites. We found
in effort. We also found evidence that focused crawling
that coverage of depression information was much
could be effective in this domain: relevant documents
greater in Google than in portals devoted to depression
from diverse sources are extensively interlinked; manyoutgoing links from a constrained crawl based on
BluePages Search (BPS)1 is a depression-specific
DMOZ lead to additional relevant content; and we
search service offered as part of the BluePages depres-
were able to achieve reasonable precision (88%) and
sion information site. Its index was built by manu-
recall (68%) using a J48-derived predictive classifier
ally identifying and crawling areas on 207 Web servers
operating only on URL words, anchor text and text
containing depression information. It took about two
content adjacent to referring links. Future directions
weeks of intensive human effort to identify these areas
include implementing and evaluating a focused
(seed URLs) and define their extent by means of include
crawler. Furthermore, the quality of information in
and exclude patterns. Similar effort would be required
returned pages (measured in accordance with the
at regular intervals to maintain coverage and accuracy.evidence based medicine) is vital when searchers are
Despite this human effort, only about 17% of relevant
consumers. Accordingly, automatic estimation of web
pages returned by Google were contained in the BPS
site quality and its possible incorporation in a focusedcrawler is the subject of a separate concurrent study.
One might conclude from this that the best way to
provide depression-portal search would be to add the
Keywords
focused crawler, hypertext classification,
word ’depression’ to all queries and forward them to
mental health, depression, domain-specific search.
a general search engine such as Google. However, inother experiments in [15] relating to quality of infor-
Introduction
mation in search results, we showed that substantial
Depression is a major public health problem, being a
amounts of the additional relevant information returned
leading cause of disease burden [13] and the leading
by Google was of low quality and not in accord with
risk factor for suicide. Recent research has demon-
best available scientific evidence. The operators of the
strated that high quality web-based depression infor-
BluePages portal (ANU’s Centre for Mental Health Re-
mation can improve public knowledge about depres-
search) were keen to know if it would be feasible to
sion and is associated with a reduction in depressive
provide a portal search service featuring:
symptoms [6]. Thus, the Web is a potentially valuable
1. increased coverage of high-quality depression in-
resource for people with depression. However, a great
Proceedings of the 9th Australasian Document ComputingSymposium, Melbourne, Australia, December 13, 2004.Copyright for this article remains with the authors.
2. reduced coverage of dubious, misleading or un-
and out-neighbours (documents that target document
cites) as input to some classifiers.
Our work also used link information. We tried to
3. significantly reduced human cost to maintain the
predict the relevance of uncrawled URLs using three
features: anchor text, text around the link and URL
We have attempted to answer the questions in two
parts. Here we attempt to determine whether it is fea-sible to reduce human effort by using a directory of
Resources
depression sites maintained by others as a seedlist and
This section describes the resources used in our exper-
using focused crawling techniques to avoid the need
iments: the BluePages search service; the data from
to define include and exclude rules. We also investi-
our previous domain-specific search experiments; the
gate whether the content of a constrained crawl links
DMOZ depression directory listing and the WEKA ma-
to significant amounts of additional depression content
and whether it is possible to tell which links lead todepression content.BluePages Search
A separate project is under way to determine
BluePages Search (BPS) is a search service offered as
whether it is feasible to evaluate the quality of
part of the existing BluePages depression information
depression sites using automatic means.
site. Crawling, indexing and search were performed by
reported elsewhere. If the outcomes of both projects
are favourable, the end-result may be a focused crawler
The list of web sites that made up the BPS was man-
capable of preferentially crawling relevant content from
ually identified from the Yahoo! Directory and from
querying general search engines using the query term’depression’. Each URL from this list was then exam-
Focused crawling - related work
ined to find out if it was relevant to depression before it
Focused crawlers, first described by de Bra et al. [2], for
was selected. The fencing of web site boundaries was a
crawling a topic-focused set of Web pages, have been
much bigger issue. A lot of human effort was needed to
frequently studied [3, 1, 5, 9, 12].
examine all the links in each web site to decide which
A focused crawler seeks, acquires, indexes, and
links should be included and excluded. Areas of 207
maintains pages on a specific set of topics that represent
web sites were selected. These areas sometimes in-
a relatively small portion of the Web. Focused crawlers
cluded a whole web server, sometimes a subtree of a
require much smaller investment in hardware and
web server and sometimes only some individual pages.
network resources but may achieve high coverage at a
Newspaper articles (which tend to be archived after a
short time), potentially distressing, offensive or destruc-
A focused crawler starts with a seed list which con-
tive materials and dead links were excluded during the
tains URLs that are relevant to the topic of interest,
it crawls these URLs and then follows the links from
A simple example of seeds and boundaries is:
these pages to identify the most promising links based
• seed = www.counselingdepression.com/, and
on both the content of the source pages and the linkstructure of the web [3]. Several studies have used sim-
• include patterns = www.counselingdepression.
ple string matching of these features to decide if the
next link is worth following [1, 5, 9]. Others used re-
In this case, every link within this web site is included.
inforcement learning to build domain-specific search
In complicated cases, however, some areas should be
engines from similar features. For example, McCallum
included while others are excluded. For instance, ex-
et al. [11] used Naive Bayes classifiers to classify hy-
amining www.drada.org would result in the following
perlinks based on both the full text of the sources and
anchor text on the links pointing to the targets.
A focused crawler should be able to decide if a page
is worth visiting before actually visiting it. This raises
the general problem of hypertext classification.
In traditional text classification, the classifier looks
only at the text in each document when deciding what
Hypertext classification is different because it tries
to classify documents without the need for the content
The above boundaries mean that everything within the
of the document itself. Instead, it uses link information.
web site should be crawled except for pages about bipo-
Chakrabati et al. [3] used the hypertext graph including
in-neighbours (documents citing the target document)
Data from our previous work
allows us to leverage off the categorisation work beingdone by volunteer editors.
In our previous work, we conducted a standardinformation
DMOZ seed generation
’depression’ queries against six engines of differenttypes:
two health portals, two depression-specific
We started from the ’depression’ directory on the
search engines, one general search engine and one
general search engine where the word ’depression’ was
added to each query if not already present (GoogleD).
Depression/. This directory is intended to contain
We then pooled the results for each query and employed
links to relevant sites and subsites about depression.
research assistants to judge them. We obtained 2778
The directory, however, also had a small list of 12
judged URLs and 1575 relevant URLs from all the
within-site links to other directories, which may or
engines. We used these URLs as a base in the present
only needed to do some minor boundary selection
We found that, over 101 queries, GoogleD returned
for these links to include relevant directories.
more relevant results than those of the domain-specific
example, the following directories were included
because they are related to depression and they are
while 683 relevant results were retrieved by GoogleD.
As GoogleD was the best performer in obtaining the
most relevant results, we also used it as a base engine
to compare with other collections in the present work.
Medications/Antidepressants/. These links were
selected simply because their URLs contain the term’depression’ (such as childhood_depression) or
DMOZ3 is the Open Directory Project which is “the
’antidepressants’. The seed URLs, as a result, included
largest, most comprehensive human-edited directory of
the above links and all the links to depression-related
the Web. It is constructed and maintained by a vast,
sites and subsites from this directory.
global community of volunteer editors”4.
Include patterns corresponding to the seed URLs
We started with the Depression directory5 which
were generated automatically. In general, the include
pattern was the same as the URL, except that default
page suffixes such as index.htm were removed. Thus,
if the URL referenced the default page of a server orweb directory, the whole server or whole directory was
Weka6 was developed at the University of Waikato in
included. If the link was to an individual page, only that
New Zealand [16]. It is a data mining package which
contains machine learning algorithms. Weka provides
The manual effort required to identify the seed
tools for data pre-processing, classification, regression,
URLs and define their extent varied greatly between
clustering, association rules, and visualization. Weka
BPS and DMOZ. While it took about two weeks of
was used in our experiments for the prediction of URL
intensive effort in the BPS case, it only required about
relevance using hypertext features. It was used because
it provided many classifiers, was easy to use and servedour purposes well in predicting URL relevance.Comparison of the DMOZ collectionand the BPS collectionExperiment 1 - Usefulness of a DMOZ
This experiment aimed to find out if a constrained crawl
category as a seed list
from the low-cost DMOZ seed list can lead to domain
A focused crawler needs a good seed list of relevant
coverage comparable to that of the manually configured
URLs as a starting point for the crawl. These URLs
should span a variety of web site types so that
After identifying the DMOZ seed list and include
the crawler can explore the Web in many different
patterns as described above, we used the Panoptic
directions. Instead of using a manually created list, we
crawler to build our DMOZ collection. We then ran the
attempted to derive a seed list from a publicly available
101 queries from our previous study and obtained 779
directory - DMOZ. Because depression sites on the
web are widely scattered, the diversity of content in
We attempted to judge the relevance of these results
DMOZ is expected to improve coverage. Using DMOZ
using the 1575 known relevant URLs (see Section 3.2)and to compare the DMOZ results with those of the
Table 1 shows that 186 out of 227 judged URLs (a
http://www.dmoz.org/Health/Mental_Health/
pleasing 81%) from the DMOZ collection were rele-
vant. However, the percentage of judged results (30%)
Table 1: Comparison of relevant URLs in DMOZ andBPS results of running 101 queries.
was too low to allow us to validly conclude that DMOZwas a good collection.
Since we no longer had access to the services of the
judges from the original study we attempted to confirm
that a reasonable proportion of the unjudged documentswere relevant to the general topic of depression by sam-pling URLs and judging them ourselves.
We randomly selected 2 lists of 50 non-overlapped
Figure 1: Illustration of one link away collection from
URLs among the unjudged results and made relevance
judgments on these. In the first list, we obtained 35relevant results and in the second list, 34 URLs were
relevant. Because there was close agreement betweenthe proportion relevant in each list we were confident
• the BPS outgoing link set containing all URLs
that we could extrapolate the results to give a reasonable
estimate of the total number of relevant pages returned.
• 2 sets of judged-relevant URLs: BPS relevant and
Extrapolation suggests 381 relevant URLs for the
able to obtain 567 (186 + 381) relevant URLs from
Our previous work concluded that BPS didn’t re-
the DMOZ set. This number was not as high as that
trieve as many relevant documents as GoogleD because
of BPS, but it was relatively high (72% relevant URLs
of its small coverage of sites. We wanted to find out if
in DMOZ set compared to 91% of these in BPS).
focused crawling techniques have the potential to raise
Therefore, we could conclude that the DMOZ list is an
BPS performance by crawling one step away from BPS.
acceptably good, low-maintenance starting point for a
Among 954 relevant pages retrieved by all engines ex-
cept for BPS, BPS failed to index 775 pages. The ex-tended crawl yielded 196 of these 775 pages or 25.3%.Experiments 2A-2C - Additional link-
In other words, an unrestricted crawler starting from
accessible relevant information
the original BPS crawl would be able to reach an addi-tional 25.3% of the known relevant pages, in only a sin-
Although some focused crawlers can look a few links
gle step from the existing pages. In fact, the true num-
ahead to predict relevant links at some distance from the
ber of additional relevant pages is likely to be higher
currently crawled URLs [7], the immediate outgoing
because of the large number of unjudged pages.
links are of most immediate interest.
It is unclear whether the additional relevant content
We performed three experiments to gauge how
in the extended BPS crawl would enable more relevant
much additional relevant information is accessible one
documents to be retrieved than in the case of GoogleD.
link away from the existing crawled content.
Retrieval performance depends upon the effectiveness
additional relevant content is linked to from pages in
of the ranking algorithm as well as on coverage.
the original crawl, the prospects of successful focusedcrawling are very low. Figure 1 shows an illustration of
Experiment 2B: Comparison of out-
the one-link-away set of URLs from the DMOZ crawl.going links between BPS and DMOZ
The first experiment (2A) involved testing if outgo-
ing links from the BPS collection were relevant while
This experiment compared the out-going link sets of
the second (2B) compared the outgoing link sets of BPS
BPS and DMOZ to find out if the DMOZ seed list could
and DMOZ to see if DMOZ was really a good place to
be used instead of the BPS seed list to guide a focused
lead a focused crawler to additional relevant content.
crawler to relevant areas of the web. The following data
The last experiment (2C) attempted to find out if URLs
relevant to a particular topic linked to each other.
• 2 sets of out-going links from the BPS and DMOZ
Experiment 2A: Outgoing links fromthe BPS collection
• 2 sets of all judged URLs and judged-relevant
The data used for this experiment included:
Collection of URLs for training and
Table 2: Comparison of relevant out-going link URLs
For both BPS and DMOZ crawls, we collected all
immediate outgoing URLs satisfying the followingtwo conditions (1) known relevant or known irrelevant
URLs and (2) the URLs pointing to each of these URLs
were also relevant. We collected 295 relevant and 251irrelevant URLs for our classification experiment.
From our previous work, we obtained 2778 judged
URLs which were used here as a base to compare rele-
Features
vance. Table 2 shows that even though the outgoing link
Several papers in the field used the content of crawled
collection of DMOZ was more than double the size of
URLs, anchor text, URL structure and other link graph
that of BPS, more outgoing BPS pages were judged.
information to predict the relevance of the next unvis-
Among the judged pages, BPS and DMOZ had 196
ited URLs [1, 5, 9]. Instead of looking at the con-
and 158 relevant pages respectively in their outgoing
tent of the whole document pointing to the target URL,
link sets. Although DMOZ had less known relevant
Chakrabarti [4] used 50 characters before and after a
pages than BPS, the proportion of relevant pages versus
link and suggested that this method was more effective.
judged pages were quite similar for both engines(78%
Our work was somewhat related to all of the above. We
for DMOZ and 79% for BPS). This result together with
used the following features to predict the relevance of
the size of each outgoing link collection implied that (1)
The DMOZ outgoing link set contained quite a largenumber of relevant URLs which could potentially be
• anchor text on the source pages: all the text ap-
accessed by a focused crawler, and (2) The DMOZ seed
pearing on the links to the target page from the
list could lead to much better coverage than the BPS
• text around the link: 50 characters before and 50
Experiment 2C: Linking patterns be-
characters after the link to the target page from the
tween relevant pages
We performed a very similar experiment to the experi-
• URL words: words appearing in the URL of the
ment described in Section 5.1, with the purpose of find-
ing out if relevant URLs on the same topic are linked toeach other. Instead of using the whole BPS collection
We accumulated all words for each of these features to
of 12,177 documents as the seed list, we only chose the
form 3 vocabularies where all stop words were elimi-
621 known relevant URLs. The following data were
nated. URL words separated by a comma, a full stop,
a special character and a slash were parsed and treatedas individual words. URL extensions such as .html,
.asp,.htm,.php were also eliminated. The end result
• the BPS outgoing link set from the above, con-
showed 1,774 distinct words in the anchor text vocab-
taining all URLs linked to by BPS known relevant
ulary, 874 distinct words in the URL vocabulary, and
1103 distinct words in the content vocabulary.
For purposes of illustration , Table 3 shows the fea-
• judged-relevant URLs from our previous work.
tures extracted from each of six links to the same URL.
The outgoing link collection of the BPS known rel-
Assume that we would like to predict www.ndmda.
evant URLs contained 5623 URLs. Of these, 158 were
org for its relevance to depression and that we have
known relevant. This was a very high number com-
six already-crawled pages pointing to it from our
pared to the 196 known relevant URLs obtained from
crawled collection. From each of the pages, features
the much bigger set of all outgoing link URLs (contain-
are extracted in the form of anchor text words and the
ing above 40,000 URLs) in the previous experiment. It
words within a range of a maximum of 50 characters
is likely from this experiment that relevant pages tend
before and after the link pointing to www.ndmda.
to link to each other. This is good evidence supporting
the feasibity of the focused crawling approach.
the target URL because that URL contains only stop
Experiment 3 - Hypertext classification
words and/or numbers which have been stripped off.The URL words for the target URL after being parsed
After downloading the content of the seed URLs and
extracting links from them, a focused crawler needs to
decide what links to follow and in what order based
7We first extracted the 50-character string and then eliminated
on the information it has available. We used hypertext
markup and stopwords, sometimes leaving only a few words.
Table 3: Features for www.ndmda.org after removing stop words and numbers.Target URL: www.ndmda.orgsource URLanchor textcontent around the link
depression, bipolar, support,alliance,american, psychiatric
depression, bipolar, support,alliance,american, psychiatric
ClassifierDescription
Zero rule. Predicts the majority class. Used as a baseline.
Statistical method. Assumes independence of attributes. Usesconditional probability and Bayes rule.
Class for building and using a Complement class Naive Bayes classifier.
C4.5 algorithm. A decision tree learner with pruning.
Class for bagging a classifier to reduce variance.
Class for boosting a nominal class classifier using the Adaboost M1 method.Classifiers
By this means we obtained a list of URLs, each
associated with the tf .idf s for all terms in the 3 vocab-
We compared a range of classification algorithms pro-
ularies. A learning algorithm was then run in Weka to
learn and predict if these URLs were relevant or irrele-
When training and testing the collection, we used
vant. We also used boosting and bagging algorithms to
a stratified cross-validation method, i.e. using 10-fold
boost the performance of different classifiers.
cross validation where one tenth of the collection wasused for training and the rest was used for testing and
Measures
the operation was repeated 10 times. The results werethen averaged and a confusion matrix was drawn to find
We used three measures to analyse how a classifier per-
formed in categorizing all the URLs. We denoted truepositive and true negative for the relevant and irrelevant
Input data
URLs that were correctly predicted by the classifier re-spectively. Similarly, false positive and false negative
We treated the three vocabularies containing all features
were used for irrelevant and relevant URLs that were
incorrectly predicted respectively. The three measures
frequency and inverse document frequency (tf .idf ) for
each feature attached to each of the URLs specified inSection 6.1 using the following formula [14].
• Accuracy: shows how accurately URLs are classi-
where t is a term, d is a document, tf (t , d ) is the fre-
• Precision: shows the proportion of correctly rele-
quency of t in d, n is total number of documents and
vant URLs out of all the URLs that were predicted
df (t ) is the number of documents containing t.
• Recall: shows the proportion of relevant URLs
did reducing the feature set using a feature selection
that were correctly predicted out of all the relevant
Conclusions and future work
Although accuracy is an important measure, a focused
Weeks of human effort were required to set up the cur-
crawler would be more interested in following the links
rent BPS depression portal search service and consider-
from the predicted relevant set to crawl other potentially
able ongoing effort is needed to maintain its coverage
relevant pages. Thus, precision and recall are better
and accuracy. Our investigations of the viability of a
focused crawling alternative have resulted in three keyfindings.Results and discussion
First, web pages on the topic of depression are
The results of some representative classifiers are shown
strongly interlinked despite the heterogeneity of
in Table 5. ZeroR represented a realistic performance
“floor” as it classified all URLs into the largest cate-
literature for other topic domains and provides a good
gory i.e relevant. As expected, it was the least accurate.
foundation for focused crawling in the depression
Naive Bayes and J48 performed best. Naive Bayes was
domain. The one-link away extensions to the closed
slightly better than J48 on recall but the latter was much
BPS and DMOZ crawls contained many relevant pages.
better in obtaining higher accuracy and precision. Out
Second, although somewhat inferior to the expen-
of 228 URLs that J48 predicted as relevant, 201 were
sively constructed BPS alternative, the DMOZ depres-
correct (88.15%). However, out of the 264 URLs pre-
sion category features a diversity of sources and seems
dicted as relevant by Naive Bayes, only 206 (78.03%)
to provide a seed list of adequate quality for a focused
were correct. Overall, the J48 algorithm was the best
crawl in the depression domain. This is very good news
performer among all the classifers used.
for the maintainability of the portal search because of
We found that bagging did not improve the classifi-
the very considerable labour savings. Other DMOZ
cation result while boosting showed some improvement
categories may provide good starting points for other
for recall (from 64.74% to 68.13%) when the J48 algo-
Third, predictive classification of outgoing links
We also performed other experiments where only
into relevant and irrelevant categories using source-
one set of features or any combination of two sets of
page features such as anchor text, content around the
features were used. In all cases, we observed that the
link and URL words of the target pages, achieved
accuracy, precision and recall were all worse than when
all three sets of features were combined.
algorithm, as implemented by Weka, we obtained high
Our best results, as detailed in Table 5, showed that
accuracy, high precision and relatively high recall.
a focused crawler starting from a set of relevant URLs,
Given the promise of the approach, there is obvious
and using J48 in predicting future URLs, could obtain a
follow-up work to be done on designing and building
precision of 88% and a recall of 68% using the features
a domain-specific search portal using focused crawling
techniques. In particular, it may be beneficial to rank
We wished to compare these performance levels
the URLs classified as relevant in the order of degree
with the state of the art, but were unable to find in the
of relevance so that a focused crawler can decide on
literature any applicable results relating to the topic
visiting priorities. Also, appropriate data structures are
of depression. We therefore decided to compare our
needed to hold accumulated information for unvisited
predictive classifier with a more conventional content
URLs (i.e. anchor text and nearby content for each
referring link.) This information needs to be updated
We built a ’content classifier’ for ’depression’, using
as additional links to the same target are encountered.
only the content of the target documents instead of the
Another important question will be how to persuade
features being used in our experiment. The best ac-
Weka to output a classifier that can be easily plugged-
curacies obtained from the two classification systems
in into the focused crawler’s architecture. Since the best
were very similar, 78% for the content classifier and
performing classifier in these trials was a decision tree,
77.8% for the predictive version. Content classification
showed slightly worse precision but better recall.
Once a focused crawler is constructed, it will be
We concluded from this comparison that hypertext
necessary to determine how to use it operationally. We
classification is quite effective in predicting the rele-
envisage operating without any include or exclude rules
vance of uncrawled URLs. This is quite pleasing as
but will need to decide on appropriate stopping condi-
a lot of unnecessary crawling can be avoided.
tions. If none of the outgoing links are classified as
Finally we explored two variant methods for fea-
likely to lead to relevant content, should the crawl stop,
ture selection. We found that generating features using
or should some unpromising links be followed? And
stemmed words caused a reduction in performance, as
ClassifierAccuracy (%)Precision (%)Recall(%)
Because of the requirements of the depression por-
[5] J. Cho, H. Garcia-Molina and L. Page. Efficient crawl-
tal operators site quality must be taken into account in
ing through url ordering. In Proceeding of the Seventh
building the portal search service. Ideally, the focused
World Wide Web Conference, 1998.
crawler should take site quality into account when de-
[6] H. Christensen, K. M. Griffiths and A. F. Jorm. Deliver-
ciding whether to follow an outgoing link, but this may
ing Interventions for Depression by Using the Internet:
or may not be feasible. Another more expensive alter-
Randomised Controlled Trial. British Medical Journal,
native would be to crawl using relevance as the sole
Volume 328, Number 7434, pages 265–0, 2004.
criterion and to filter the results based on quality.
[7] M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles
Site quality estimation is the subject of a separate
and M. Gori. Focused crawling using context graphs. In
study, yet to be completed. In the meantime, it seems
Proceeding of the 26th VLDB Conference, Cairo, Egypt,
fairly clear from our experiments that it will be possible
to increase coverage of the depression domain for dra-
[8] Berland G, Elliott M, Morales L, Algazy J, Kravitz
matically lower cost by starting from a DMOZ category
R, Broder M, Kanouse D, Munoz J, Puyol J, Lara M,
Watkins K, Yang H and McGlynn E. Health Information
Verifying whether techniques found useful in this
on the Internet: Accessibility, Quality, and Readability
project also extend to other domains is an obvious fu-
in English and Spanish. The Journal of the AmericanMedical Association, Volume 285, Number 20, pages
ture step. Other health-related areas are the most likely
candidates because of the focus on quality of informa-tion in those areas.
[9] M. Hersovici, M. Jacovi, Y. S. Maarek, D. Pellegb,
M. Shtalhaima and S. Ura. The shark-search algorithm.an application: tailored web site mapping. In Proceed-Acknowledgmentsing of the Seventh World Wide Web Conference, 1998.
We gratefully acknowledge the contribution of Kathy
[10] Griffiths K and Christensen H. The quality and acces-
Griffiths and Helen Christensen in providing expert in-
sibility of australian depression sites on the world wide
put about the depression domain and about BluePages,
web. The Medical Journal of Australia, Volume 176,
and of John Lloyd and Eric McCreath for their advice
[11] A. McCallum, K. Nigam, J. Rennie and K. Seymore.
Building domain-specific search engines with machine
ReferencesSymposium on Intelligents Engine in Cyberspace, 1999.
[1] C. C. Aggarwal, F. Al-Garawi and P. S. Yu.
[12] F. Menczer, G. Pant and P. Srinivasan. Evaluating topic-
the design of a learning crawler for topical resource
driven web crawlers. In Proceeding of the 24th Annual
discovery. ACM Trans. Inf. Syst., Volume 19, Number 3,
Intl. ACM SIGIR Conf. On Research and Developmentin Information Retrieval, 2001.
[2] P. De Bra, G. Houben, Y. Kornatzky and R. Post.
[13] C. J. L. Murray and A. D. Lopez (editors).
Information retrieval in distributed hypertexts. In Pro-Global Burden of Disease and Injury Series. Harvard
ceedings of the 4th RIAO Conference, pages 481–491,
University Press, Cambridge MA, 1996.
[14] G. Salton and C. Buckley. Term weighting approaches
[3] S. Chakrabarti, M. Berg and B. Dom. Focused crawling:
in automatic text retrieval. Technical report, 1987.
A new approach to topic-specific web resource discov-
[15] T.T. Tang, N. Craswell, D. Hawking, K. M. Griffiths
ery. In Proceeding of the 8th International World Wide
and H. Christensen. Quality and relevance of domain-
specific search: A case study in mental health.
[4] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan,
appear in the Journal of Information Retrieval - Special
D. Gibson and J. Kleinberg. Automatic resource com-
pilation by analyzing hyperlink structure and associated
[16] I. H. Witten and E. Frank. Data Mining: Practical ma-
text. In Proceedings of the seventh international con-chine learning tools with Java implementations. Morgan
ference on World Wide Web 7, pages 65–74. Elsevier