these are [hopefully] some of my last questions before deployment
[but not forever;]... keep your fingers crossed!
-------------------------------------------------------------------------
i am using 3.2.0b4 and am using the default htdig.conf with the
exception of adding all (50) my start URLs.
searching produces the word searched for in its first instance, but
does this include meta tag keywords in the html document?
what are the alternatives when searching, document versus meta tag?
a) how do you restrict the search to meta tagged keywords?
b) can searches produce results from both the meta tag keywords and
the document?
c) can searches be restricted to the document only?
i read this in the FAQ:
"4.14. How do I restrict a search to only meta keywords entries in documents?
First of all, you do not do this by using the "keywords" field in the
search form. This seems to be a frequent cause of confusion. The
"keywords" input parameter to htsearch has absolutely nothing to do
with searching meta keywords fields. It actually predates the
addition of meta keyword support in 3.1.x. A better choice of name
for the parameter would have been "requiredwords", because that's
what it really means - a list of words that are all required to be
found somewhere in the document, in addition to the words the user
specifies in the search form.
To restrict a search to meta keywords only, you must set all factors
other than keywords_factor to 0, and for 3.1.x, you must then reindex
your documents. In 3.2, you'll be able to change factors at search
time without needing to reindex, as well as offering the ability to
restrict the search in the query itself."
but i don't quite understand it.
where is the "keywords_factor", (did i miss it?)? If it is NOT
restricted to "keywords_factor" does it index both meta and document?
when it says "in 3.2 you will be able to..." is that present in the
snapshot of b4 now? if so, how does it work?
thank you VERY much.
-Ted
:(

mmmm.......is there some limit with the file name ?
mmmmm i don't think so,
..the files have a long file name .....
i see in the -vvv report that even in "locally" option
the htdgi try to open the index.html file , wich file
doesn't existe...of course , that's why i use the
auto_index module in apache to build it.
next week i'll send the exaxt htdig's error ,i am
going home.
have a nice weekend
--- Gilles Detillieux <grdetil@...>
wrote:
> According to Jose Julian Buda:
> > but when i ran htdig -vvv with this parameter ,
> > something was wrong , because it said something
> "local
> > filesystem failed" or something like that and then
> try
> > via http , but why locally it dont work ?
> > the files are .html .
> > so?
>
> To really be able to help, I'd need to know the
> exact error messages,
> what your local_urls setting looks like, what the
> URLs for your site
> look like, and where the documents reside on your
> server (i.e. the
> DocumentRoot in Apache).
>
> Otherwise, I can just guess that either local_urls
> is set incorrectly,
> or there are problems of some sort accessing the
> files.
>
> E.g.: on a Red Hat 6.2 system, the default
> DocumentRoot is
> /home/httpd/html, so the local_urls setting for a RH
> 6.2 server at,
> say http://www.mystuff.org, would be:
>
> local_urls:
> http://www.mystuff.org/=/home/httpd/html/
>
> Note that trailing slashes are important. Also, I'd
> have to make sure the
> files and directories are readable and searchable
> (i.e. the directories
> are executable) by the user ID under which I run
> htdig (as well as of
> course by the user ID under which the web server
> runs).
>
> Also, make sure your URLs use a consistent hostname.
> If your server
> is known by different names, you need to resolve
> them to a single
> name using server_aliases, otherwise the URLs may
> not match what you
> specify in local_urls.
>
> E.g., for my web server, URLs may have the host name
> http://www.scrc.umanitoba.ca
> or just scrc.umanitoba.ca (without the "www."), so I
> use
>
> server_aliases:
> scrc.umanitoba.ca:80=www.scrc.umanitoba.ca:80
>
> to add the missing "www." as needed.
>
> --
> Gilles R. Detillieux E-mail:
> <grdetil@...>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone:
> (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax:
(204)789-3930
__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

According to Jose Julian Buda:
> but when i ran htdig -vvv with this parameter ,
> something was wrong , because it said something "local
> filesystem failed" or something like that and then try
> via http , but why locally it dont work ?
> the files are .html .
> so?
To really be able to help, I'd need to know the exact error messages,
what your local_urls setting looks like, what the URLs for your site
look like, and where the documents reside on your server (i.e. the
DocumentRoot in Apache).
Otherwise, I can just guess that either local_urls is set incorrectly,
or there are problems of some sort accessing the files.
E.g.: on a Red Hat 6.2 system, the default DocumentRoot is
/home/httpd/html, so the local_urls setting for a RH 6.2 server at,
say http://www.mystuff.org, would be:
local_urls: http://www.mystuff.org/=/home/httpd/html/
Note that trailing slashes are important. Also, I'd have to make sure the
files and directories are readable and searchable (i.e. the directories
are executable) by the user ID under which I run htdig (as well as of
course by the user ID under which the web server runs).
Also, make sure your URLs use a consistent hostname. If your server
is known by different names, you need to resolve them to a single
name using server_aliases, otherwise the URLs may not match what you
specify in local_urls.
E.g., for my web server, URLs may have the host name http://www.scrc.umanitoba.ca
or just scrc.umanitoba.ca (without the "www."), so I use
server_aliases: scrc.umanitoba.ca:80=www.scrc.umanitoba.ca:80
to add the missing "www." as needed.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

According to Sunny Fortune:
> > As it turns out, the first htmerge pass, after
> > htdig, is needed on each
> > database before you run htmerge -m. The code that
> > handles the merging
> > of two databases expects that the wordlist has
> > already been purged of
> > control records that htdig uses to tell htmerge
> > about documents to update
> > or delete.
>
> What does "already been purged of control records"
> mean?
Essentially it means you've already run htmerge after htdig. htdig puts
not only words in db.wordlist, but also some control records which tell
htmerge to clear out certain records from the database. htmerge only
expects and understands these records when run without the -m option.
If you run htmerge -m and the wordlist for the database you're merging
has some of these control records, their DocIDs don't get adjusted so
htmerge ends up clearing out the wrong records from the database.
> I am presently running digs on each of my site and
> then finally performing a merge to one of the sites.
> Example,
> htdig -c site1.conf
> htdig -c site2.conf
> htdig -c site3.conf
>
> htmerge -c site1.conf
> htmerge -c site1.conf -m site2.conf
> htmerge -c site1.conf -m site3.conf
>
> So the search is performed on the merged database at
> site1.
>
> Isn't this also a correct method?
No. You must also run "htmerge -c site2.conf" and "htmerge -c site3.conf"
before merging sites 2 and 3 into 1. Otherwise, you run the risk of losing
some site1 records, possibly some valid site2 records, and maybe even some
valid site3 records as well.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

According to Fogaras Daniel:
> Thanks for your mail. The same problem appears in db.worddump, the
> words belonging to document ID 231 are not contained in the page
> http://www.mit.edu/people/asandqui/home.html (which is assigned with 231
> in db.docs). So it seems that not only the excerpts but the document IDs
> are also mixed.
That's interesting. Do the words in db.worddump for DocID 231 match up
with the words in the db.docs H field for 231? If so, it would seem that
it's not the db.docdb, db.words.db and db.excerpts databases that are
messed up, but rather the db.docs.index file. This latter database is
the one that maps URLs to DocIDs, so if it gets messed up, there would
obviously be confusion about which URL corresponds to which database
records in the other 3 files.
> We use 3.20b3 and you suggested to install 3.20b4. However, there is not a
> 3.20b4 version, this is mentioned as a "snapshot" with a sentence
> "Do NOT use this code unless you like living on the bleeding edge." Do you
> think these snapshots are stable?
The warning above is essentially a standard disclaimer. Because the
snapshots are just that, a week by week picture of the state of the
code in the CVS tree, we can't always promise that we won't have just
committed some buggy modifications. This does happen from time to time,
and we don't always catch it by the end of the week.
However, at the current time, the 3.1.6 snapshot is much more solid
than even the 3.1.5 "stable" release, and the current state of things in
3.2.0b4, while still only beta quality, fixes TONS of known problems in
3.2.0b3, so right now the snapshots are a much safer bet for both the
stable and beta development lines.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

but when i ran htdig -vvv with this parameter ,
something was wrong , because it said something "local
filesystem failed" or something like that and then try
via http , but why locally it dont work ?
the files are .html .
so?
--- Gilles Detillieux <grdetil@...>
wrote:
> According to Jose Julian Buda:
> > well , my search engine is on the same webserver's
> > machine , so....i'll try out this parameter
> > "local_urls"
> > do i need rebuild all the database to use this
> > parameter ?
>
> All the parameter does is tell htdig how to get the
> files locally,
> but it doesn't (or shouldn't) change what htdig puts
> into the database
> for these documents. So no, you don't need to
> rebuild your database
> just because you changed this attribute, but it
> won't be used until you
> rebuild or update your database. If you want to
> test it right away,
> you can just run htdig -vv followed by htmerge (for
> 3.1.x) or htpurge
> (for 3.2.0bx) to see if it's finding the local
> files. It should report
> "not changed" for all the files.
>
> Do carefully read the documentation for local_urls,
> as this attribute
> only works with a very limited set of file name
> suffixes (or extensions),
> and will fall back to the HTTP server for anything
> else.
>
> --
> Gilles R. Detillieux E-mail:
> <grdetil@...>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone:
> (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax:
> (204)789-3930
>
> _______________________________________________
> htdig-general mailing list
> <htdig-general@...>
> To unsubscribe, send a message to
> <htdig-general-request@...> with a
> subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
=====
Jose Julian Buda
WebMaster - http://www.noticiasargentinas.com
Noticias Argentinas
"Soñe con un mundo , un mundo sin Windows..."
__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

> As it turns out, the first htmerge pass, after
> htdig, is needed on each
> database before you run htmerge -m. The code that
> handles the merging
> of two databases expects that the wordlist has
> already been purged of
> control records that htdig uses to tell htmerge
> about documents to update
> or delete.
What does "already been purged of control records"
mean?
I am presently running digs on each of my site and
then finally performing a merge to one of the sites.
Example,
htdig -c site1.conf
htdig -c site2.conf
htdig -c site3.conf
htmerge -c site1.conf
htmerge -c site1.conf -m site2.conf
htmerge -c site1.conf -m site3.conf
So the search is performed on the merged database at
site1.
Isn't this also a correct method?
Thanks,
Sunny Fortune
__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

> New server: burnie.mainsoft.com, 80
> Unable to build connection with burnie.mainsoft.com:80
You should probably give the internal address while making a request. =
Your environment may not allow you to make external requests.
--Sandeep
~~~~~~~~~~~~~~~~~~~
Sandeep Hulsandra
Product Developer
InfiNet

Firstly are you sure that that webserver is working? I can't seem to
connect to it myself, (but it does respond to pings) but it may be
firewalled to the outside so I can't really tell...
ie can you look at http://burnie.mainsoft.com/ correctly?
--- Alex
Rekha Das wrote:
>
> I have installed htdig on my server. When I run htdig it gives the following
> error:
>
> ----------------------------------------------------------------------------
> ---------------------------------
> New server: burnie.mainsoft.com, 80
> Unable to build connection with burnie.mainsoft.com:80
> ----------------------------------------------------------------------------
> ---------------------------------
>
> I have apache-1.3.12 and raven ssl.
>
> Thanks,
> Rekha
>
> _______________________________________________
> htdig-general mailing list <htdig-general@...>
> To unsubscribe, send a message to <htdig-general-request@...> with a subject of unsubscribe
> FAQ: http://htdig.sourceforge.net/FAQ.html
--
-----------------------------------
Alexander Cohen
Web Developer
La Trobe University - ITS
A.Cohen@...
M: 0419-595-817
W: (03) 9479-3444
-----------------------------------

I have installed htdig on my server. When I run htdig it gives the following
error:
----------------------------------------------------------------------------
---------------------------------
New server: burnie.mainsoft.com, 80
Unable to build connection with burnie.mainsoft.com:80
----------------------------------------------------------------------------
---------------------------------
I have apache-1.3.12 and raven ssl.
Thanks,
Rekha

I'm resending this because I forgot to put the list address back on
the cc list. (I really don't like off-list replies. See FAQ 1.16 for
the reasons.)
According to Franck Collineau:
> I have installed the 3.1.6.
>
> I have set the input parameters in htdig.conf
No, you set the configuration attributes in htdig.conf. Input parameters
go in your search form. See http://www.htdig.org/FAQ.html#q4.18
> But the documentation talks about "search form "
>
> "This specifies the day of the cutoff start date for search results. If the
> start or end date are specified, only results with a last modified date
> within this range are shown. The startday can be specified from within the
> configuration file, and can be overridden with the "startday" input parameter
> in the search form."
It says you can override them in your search form. That means you
can set them in either the htdig.conf file, or your search.html (as
CGI input parameters), but if you use both the search form input
parameters take precedence. The same goes for a number of other
CGI input parameters which have defaults specified by attributes.
See http://www.htdig.org/hts_form.html
> I don't know what is this search form.
> I have tried search.html but it is not.
>
> Where is it ?
Yes, it is search.html. You can add input or select tags there for any of
the input parameters that htsearch allows, as described in the references
above. If you don't understand how to work with HTML forms, read up on
them in a tutorial on forms, such as...
http://MasterCGI.com/howtoinfo/formtutorial.shtml
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

On Thu, 29 Nov 2001 trogers@... wrote:
> Sorry to bother you -I realize the info is probably fairly close by-
> how about "max_hops", will that attribute operate in 3.2.0b4?
Please see the
documentation... e.g. <http://www.htdig.org/dev/htdig-3.2/cf_byname.html&gt;
(or htdoc/*.html in your source distribution.)
You'll see, for example that there is no such attribute as "max_hops" but
there is a max_hop_count, which has existed for a long time (i.e. before
3.0).
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

Sorry to bother you -I realize the info is probably fairly close by-
how about "max_hops", will that attribute operate in 3.2.0b4?
If not, what is the best way to downgrade my installation -and to
what version? (I used 3.2.0b4 because it said a security leak was
fixed in it... there is a sys admin somewhere who's probably watching
me ;).
Thanks again for you time.
-Ted
At 4:12 PM -0600 11/29/01, Gilles Detillieux wrote:
>According to trogers@...:
>> Btw, does the "max_excerpts" attribute work in 3.2.0b4?
>
>No. I wasn't going to add anything new to htsearch in 3.2 until
>Geoff finishes with the reorganisation of the code. I don't expect
>that will happen until 3.1.6 is finished and released, but it's
>Geoff's call. Still, I probably won't do much with 3.2 until I'm
>done with 3.1.6.
>
>--
>Gilles R. Detillieux E-mail: <grdetil@...>
>Spinal Cord Research Centre WWW:
>http://www.scrc.umanitoba.ca/~grdetil
>Dept. Physiology, U. of Manitoba Phone: (204)789-3766
>Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

According to trogers@...:
> Btw, does the "max_excerpts" attribute work in 3.2.0b4?
No. I wasn't going to add anything new to htsearch in 3.2 until
Geoff finishes with the reorganisation of the code. I don't expect
that will happen until 3.1.6 is finished and released, but it's
Geoff's call. Still, I probably won't do much with 3.2 until I'm
done with 3.1.6.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

At 1:39 PM -0600 11/29/01, Gilles Detillieux wrote:
>According to trogers@...:
>> i'm using htdig 3.2.0b4.
>>
>> i meant to ask also if the other "programs" fuzzy, et al, would be
>> working considering how i installed (see below, please).
>> shall i move htfuzzy and htdig, et al, into cgi-bin with htsearch?
>
>No, you should only put htsearch in cgi-bin. The other programs can
>go wherever works for you, but usually somewhere in your PATH is the
>most convenient place for them. Some scripts may need to be customised
>if you move htdig, htmerge, htfuzzy, htnotify or others, if these scripts
>refer to the pathname of the directory where they were originally to be
>installed. The rundig script is one such script that you many need to
>customise.
>
>The tricky part is if you move your htsearch configuration file(s) to
>a different directory than you specified when you originally configured
>the software, because this directory name is compiled into the htsearch
>program, so if you move the directory htsearch won't find its config
>files.
Very good of you. Yes, I was careful of my installation, I am
confident everything is installed (prefixes, etc.) so that could move
the htdig and also install everything else in my home dir instead of
the default install locations.
>
>> and one more question: here is a set of test urls i put in my htdig conf:
>>
>> start_url: http://slis-two.lis.fsu.edu/~G634-23/LIS5364/
>> http://slis-two.lis.fsu.edu/~G634-1/ip1.htm
>> http://slis-two.lis.fsu.edu/~G634-1/ip2.htm
>> http://slis-two.lis.fsu.edu/~G634-1/ip3.htm
>> http://slis-two.lis.fsu.edu/~G634-1/tp1.htm
>>
>> (it may not come out right in email but each url is separated by 4
>> spaces, the last one has 3 spaces) when i added these and ran my test
>> search (http://slis-two.lis.fsu.edu/~G634-23/test.html) i had to go
>> back and run ./rundig again to get it to pickup the 2nd and 3rd urls
>> -it doesn't get the last two at all... when i searched for the word
>> "information".
>>
>> why do i haceve to run rundig again and again and it still doesn't
>> get all urls? i am soon going to put 50!!!!!
>
>There are two different ways of interpreting your question.
>
>1) You're expecting the database to automatically pick up any new
>URLs in start_url without having to run rundig again, or run htdig
>and htpurge.
>
>2) You are running rundig again after adding URLs to start_url, and
>the database is still not picking up the new URLs.
>
>If it's the first case, you don't understand how the system works.
>The htsearch program doesn't update the databases, it only reads them,
>so whenever you change a config attribute that affects what goes into
>the database, you need to rebuild or at least update the database, with
>the htdig program. The rundig script runs htdig with the -i option, to
>rebuild from scratch, and then runs htpurge to clean up unused entries.
>You can update the database instead of rebuilding from scratch, by
>running htdig (without -i) and htpurge separately. You will need to
>do this from time to time to make sure your database picks up any
>updates to the web sites as well. This is usually done via a shell
>script run from your crontab. (See "man crontab" on your system.)
>
>If the second point above is what you mean, then you need to find out why
>htdig isn't indexing everything. See http://www.htdig.org/FAQ.html#q4.1
I'm sorry to have caused you to type so much. I REALLY appreciate
what you do on this list.
I was referring to #2, and it turned out (my bad) the pages had meta
tag blocks on them. DOH! sorry.
>
>... and earlier...
>> i have tested to the following extent: i can search my own pages and
>> as far as i know any other public www pages, e.g., htdig.org and
>> others, including a students public site that resides on the same
>> server -i use the urls, and i will use the urls, in the form of
>> http://blah.blah.blah/ for every student url i add to the conf start
>> url "list".
>>
>> so... will it work?
>
>Well, if you can index one site and search it, then obviously it works.
>There's nothing in htdig to prevent it from also working on 50 or more
>sites, so it should work. The only way to know for sure is to try it, and
>run it in debugging mode (with -v options) if it doesn't work the way you
>think it should. Of course, it helps to have a correct idea of how you
>think it should work, and that's where reading the documentation comes in.
Great. I will hopefully report back in December how wonderfully this
worked for this OS X user, unix beginner.
Btw, does the "max_excerpts" attribute work in 3.2.0b4?
Thanks very much.
Ted Rogers

According to Jose Julian Buda:
> well , my search engine is on the same webserver's
> machine , so....i'll try out this parameter
> "local_urls"
> do i need rebuild all the database to use this
> parameter ?
All the parameter does is tell htdig how to get the files locally,
but it doesn't (or shouldn't) change what htdig puts into the database
for these documents. So no, you don't need to rebuild your database
just because you changed this attribute, but it won't be used until you
rebuild or update your database. If you want to test it right away,
you can just run htdig -vv followed by htmerge (for 3.1.x) or htpurge
(for 3.2.0bx) to see if it's finding the local files. It should report
"not changed" for all the files.
Do carefully read the documentation for local_urls, as this attribute
only works with a very limited set of file name suffixes (or extensions),
and will fall back to the HTTP server for anything else.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

According to trogers@...:
> i'm using htdig 3.2.0b4.
>
> i meant to ask also if the other "programs" fuzzy, et al, would be
> working considering how i installed (see below, please).
> shall i move htfuzzy and htdig, et al, into cgi-bin with htsearch?
No, you should only put htsearch in cgi-bin. The other programs can
go wherever works for you, but usually somewhere in your PATH is the
most convenient place for them. Some scripts may need to be customised
if you move htdig, htmerge, htfuzzy, htnotify or others, if these scripts
refer to the pathname of the directory where they were originally to be
installed. The rundig script is one such script that you many need to
customise.
The tricky part is if you move your htsearch configuration file(s) to
a different directory than you specified when you originally configured
the software, because this directory name is compiled into the htsearch
program, so if you move the directory htsearch won't find its config
files.
> and one more question: here is a set of test urls i put in my htdig conf:
>
> start_url: http://slis-two.lis.fsu.edu/~G634-23/LIS5364/
> http://slis-two.lis.fsu.edu/~G634-1/ip1.htm
> http://slis-two.lis.fsu.edu/~G634-1/ip2.htm
> http://slis-two.lis.fsu.edu/~G634-1/ip3.htm
> http://slis-two.lis.fsu.edu/~G634-1/tp1.htm
>
> (it may not come out right in email but each url is separated by 4
> spaces, the last one has 3 spaces) when i added these and ran my test
> search (http://slis-two.lis.fsu.edu/~G634-23/test.html) i had to go
> back and run ./rundig again to get it to pickup the 2nd and 3rd urls
> -it doesn't get the last two at all... when i searched for the word
> "information".
>
> why do i haceve to run rundig again and again and it still doesn't
> get all urls? i am soon going to put 50!!!!!
There are two different ways of interpreting your question.
1) You're expecting the database to automatically pick up any new
URLs in start_url without having to run rundig again, or run htdig
and htpurge.
2) You are running rundig again after adding URLs to start_url, and
the database is still not picking up the new URLs.
If it's the first case, you don't understand how the system works.
The htsearch program doesn't update the databases, it only reads them,
so whenever you change a config attribute that affects what goes into
the database, you need to rebuild or at least update the database, with
the htdig program. The rundig script runs htdig with the -i option, to
rebuild from scratch, and then runs htpurge to clean up unused entries.
You can update the database instead of rebuilding from scratch, by
running htdig (without -i) and htpurge separately. You will need to
do this from time to time to make sure your database picks up any
updates to the web sites as well. This is usually done via a shell
script run from your crontab. (See "man crontab" on your system.)
If the second point above is what you mean, then you need to find out why
htdig isn't indexing everything. See http://www.htdig.org/FAQ.html#q4.1
... and earlier...
> i have tested to the following extent: i can search my own pages and
> as far as i know any other public www pages, e.g., htdig.org and
> others, including a students public site that resides on the same
> server -i use the urls, and i will use the urls, in the form of
> http://blah.blah.blah/ for every student url i add to the conf start
> url "list".
>
> so... will it work?
Well, if you can index one site and search it, then obviously it works.
There's nothing in htdig to prevent it from also working on 50 or more
sites, so it should work. The only way to know for sure is to try it, and
run it in debugging mode (with -v options) if it doesn't work the way you
think it should. Of course, it helps to have a correct idea of how you
think it should work, and that's where reading the documentation comes in.
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930

well , my search engine is on the same webserver's
machine , so....i'll try out this parameter
"local_urls"
do i need rebuild all the database to use this
parameter ?
Thank you
--- Gilles Detillieux <grdetil@...>
wrote:
> According to David T. Ashley:
> > What kind of site are you indexing?
> >
> > How is the communication between the machine with
> the search engine and the
> > machine hosting the site?
> >
> > Are the server and search engine on the same
> machine?
> >
> > What OS?
>
> Jose Juilan Buda wrote:
> > Did someone index more than 45,000 files with
> htdig ?
> > i have a pentium III 933 , 256 RAM , 30 gb IDE ,
> and
> > when i run "rundig" to create the database , it
> take
> > almost 10 hours to make the complete database
> from
> > the begin.
> > I increment the "timeout" apache parameter
> > to...well...very high ,and this time it work , but
> > take much time.
> > Is that correct ?
> > I hope that from now , just running htdig and
> htmerge
> > , make the update and take no much time.
>
> And earlier...
> > Because i do have some problem with the digging
> > proccess . I set now the "timeout" parameter to
> 10000
> > because i think that is the problem because when i
> run
> > "htdig -vvv" i see that the program lock waiting
> in
> > the
> >
> > "Retrieval command for
> > http://myserver/mydirectory_to_index/ GET
> > "/mydirectory_to_index/ HTTP/1.0...
> >
> > and then it said..
> >
> > "Header Line:
> > returnStatus = 1
> > not found
> > pick: myserver, # servers = 1"
> >
> > but this directory in my webserver exist.
> >
> > So ? It's an Apache configuration problem ?
>
> Yes, these messages are consistent with htdig timing
> out while waiting
> for an HTTP header from the server. What isn't
> clear to me from your
> messages is whether increasing the timeout to
> something large makes it
> work correctly, or whether it's still failing.
>
> If it's working correctly, I don't see what the
> problem is. Certainly,
> indexing 45,000 documents over HTTP is going to take
> quite a while, so
> 10 hours doesn't seem unreasonable. You may be able
> to avoid the hangs
> and timeouts by setting server_wait_time to
> something like 1 or 2, but
> then it may take longer still to index the site,
> because of the pause
> between documents fetched.
>
> Once you've got a complete database of all your
> documents, updating it
> with htdig (without the -i option) and htmerge
> should be quicker, as
> it can quickly check which documents are unchanged,
> and it won't fetch
> or reparse these.
>
> David's question about whether the web server and
> htdig are on the same
> machine is quite significant. If they are, you can
> take advantage of
> the local_urls feature to speed up indexing by
> bypassing the HTTP server
> and fetching files right from the local filesystem.
> This will be an
> added benefit for update digs too, because checking
> for updated documents
> will be very, very quick.
>
> --
> Gilles R. Detillieux E-mail:
> <grdetil@...>
> Spinal Cord Research Centre WWW:
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba Phone:
> (204)789-3766
> Winnipeg, MB R3E 3J7 (Canada) Fax:
(204)789-3930
=====
Jose Julian Buda
WebMaster - http://www.noticiasargentinas.com
Noticias Argentinas
"Soñe con un mundo , un mundo sin Windows..."
__________________________________________________
Do You Yahoo!?
Yahoo! GeoCities - quick and easy web site hosting, just $8.95/month.
http://geocities.yahoo.com/ps/info1

According to trogers@...:
> should htdig/htsearch results show all instances (or only one) of the
> searched for "word"?
> (e.g., my search results show one instance of the "word" and of
> course it gives the url, if i go to the url and use the browser to
> search the same "word" i can retrieve lots of them.)
Normally, htsearch only shows the first matching word in the excerpt
of the search results. In the 3.1.6 snapshot, there is a max_excerpts
attribute to set it to show more than 1.
See http://www.htdig.org/files/snapshots/
--
Gilles R. Detillieux E-mail: <grdetil@...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930