I suggest moving this discussion to the htdig3-dev list. Anyone who
wants to follow it should feel free to do so.

At 5:20 PM -0500 2/23/00, J Kinsley wrote:
>NOTE: ht://Dig is running on the same physical host as the web server
>it indexing, so network bandwidth is not a factor here.

First off, I'd suggest using local_urls, which you don't mention.
That would certainly give you a rather significant speed boost, but I
digress.

>Using my estimated end time above, we're looking at a 27 hour
>increase in index time on ~50,000 URL's. I do not think this is you
>mean by 'a few trade-offs', so I am guessing it is a bug. Although I
>do not fully understand how to detect memory leaks, I suspect that is
>the problem. When I first start htdig, it indexes the first 1000
>URL's in about 6 minutes and the RSS creeps up to around 18-19MB and
>it starts to slow down.

It isn't a bug. I also doubt it's an actual leak--we've run the
source through Purify a few times. Now there may still be leaks--we
clearly haven't hit all the code in testing, but I don't think it's
that. (My favorite quick-and-dirty memory debugger is called
LeakTracer, you can find it on Freshmeat.net.)

The explanation is going to be a long, so don't say I didn't warn
you. I'd guess the biggest performance hit in indexing right now
comes from a trade-off. And when I say a "trade-off," I know what I'm
talking about--remember that you might have a compression algorithm
that takes much longer to compress than to decompress--this is what I
mean.

Previous versions stored the document DB keyed by URL. This was great
for indexing, you'd just check to see if a given URL existed could
retrieve it easily. The snag comes with the word database, which
stored words by DocID. So when doing a search, htsearch had to go
lookup the URLs in a DocID->URL index (db.docs.index). This is really
silly--we'd much rather have htdig do the work and have htsearch fly,
so 3.2 keys the document DB by DocID as well, so htsearch doesn't do
any additional lookups.

But now htdig is stuck--as it stands, it's looking up the DocID in a
URL->DocID database every time it wants to retrieve a document or
check to see if the document is in the database. This is bad.
(Remember, the lookup is going to slow down as we get more documents.)

There's also the matter discussed a while ago about the "Need2Get"
list of URLs. It's big since it's a full hash table of all the URLs
we've visited. Since it's a hash, it has a lot of free space. So if
you don't have the memory for that, you're going to start paging.

Look, I can point to lots of places throughout the code where there
are really bad speed hits. The main developers right now are all
volunteers and we all have our hands full--if you'd like to help with
optimizations, please do!

-Geoff

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
htdig-unsubscribe@htdig.org
You will receive a message to confirm this.