On Fri, 23 Oct 1998, Maciej Stachowiak wrote:
>
> This isn't directly on-topic, but I was going to suggest earlier that
> you change the header alist to a vector despite your readability
> concerns; it should be trivial to write a hashtab-header-alist or
> hashtab-properties in Scheme to translate it, and the more I think
> about it, the more convinced I am it would be a noticeable performance
> win.
>
yeah, I think you're right. I earlier replied that I like to do
(car myhastab) to see its stats, but I could just as easily do
(hashtab-stats myhashtab) and maybe throw in some real stats like the
mean/median bucket size, standard deviation, etc.
I think you're right about making the header a vector. I'm currently
calling scm_sloppy_assq 4 times upon insertion of a new key-value pair.
(5 times if the current bucket size exceeds the largest bucket size so
far).
> Regarding this proposal, basically nothing can save you from a hash
> function that hashes a lot of things to the exact same value; no
> number of resizes will help. The only hope for salvation here is for
> hashing to significantly change when the size of the range changes, so
> that resizing will actually solve such collisions.
I've found that the expected maximum bucket size seems to be related to
log(number-entries). I can't prove that, but I am currently modifying
max-bucket-size to grow at this rate and it does seems to work well.
> Another note: I use both eq? and equal? hashing in my code a fair bit.
> Also it would be nice if your code were able to hash by any of eq?,
> eqv? or equal? (the hashers for the first two should not be hard). At
> that point, I would suggest that a suitably cleaned up version of your
> code should replace Guile hash tables entirely, even 3% or 5%
> performance overhead is a small price to pay for auto-resizing.
>
well, all we have to do there is rewrite the basic procedures to
accept different eq predicates, then write a bunch of wrapper functions,
i.e.
hashtab_setqx(key, value)
would call
hashtab_basic_setx(key, value, scm_eq_p, my_hasherq)
of course, the hasher functions would all be quite different.
> Another reason that maybe not considering the ultimate hash range in
> the hash function may be a lose: the obvious way to implement `hashq'
> with no mod is to cast the SCM value as an int, this will never
> collide in a word-size space but almost certainly will in many common
> hashtable sizes.
My current tests use a big file filled with unique, and very large, random
numbers. But still, the bigger the hashtable, the bigger the expected
largest bucket size (relative to average bucket size).
Jay
jglascoe@jay.giss.nasa.gov