Feature Requests item #1057882, was opened at 2004-10-31 17:21
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=351355&aid=1057882&group_id=1355
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Sam Steingold (sds)
Assigned to: Bruno Haible (haible)
Summary: indicator (set) hash-tables
Initial Comment:
sometimes a hash table is a just a set,
i.e., not a general map, but a {0,1}-valued one.
I suggest a new hash table category, created via,
e.g., :TYPE :SET argument to MAKE-HASH-TABLE,
where there is NO values vector and GETHASH returns
either T,T or NIL,NIL.
It should be printed as
#S(HASH-TABLE <TEST> list-or-values)
and :INITIAL-CONTENT argument to MAKE-HASH-TABLE
should be a LIST (not an Association List).
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=351355&aid=1057882&group_id=1355

Concerning set-difference ...
If it is known that both sets are sorted, the calculation can proceed
very quickly indeed. Consider, apart from hashing, to write
#'sorted-set-diff, which presumes sorting, and then use this on sorted
sets. Intuitively, since decent sorting takes place in (* n (log n)
time, and #'sorted-set-diff is one-pass, this could be a favorable
solution for larger sets. Might compete favorably with hashing.
It would require,however, two predicates: A :test parameter and a
:pecedes parameter, so you might wind up inferring #'< from #'= and so
forth. That could get sticky.
Just a thought.
Devious Dan

Bruno Haible wrote:
> Sam wrote:
>
>>is LENGTH always cheap enough?
>
> LENGTH, <= etc. are cheap.
NTH should be even cheaper when for checks like this, where one is only
interested in wether there are at least n elements. (although using NTH
like this is probably bad style)
Martin

Bruno Haible writes:
> Hash-tables are certainly good for big lists. However, small lists (of size
> < 10) are the most frequent ones. Are you sure that your change didn't slow
> down the frequent case?
>
> I would recommend to find out the threshold from which the hash table solution
> is more efficient, and continue to use the old list-based (and non-consing!)
> approach for the "small lists" case.
Even for the simple cases (eq/eql) there are interesting issues.
One is how to measure, another is when/where to measure.
I suggest not measuring at compile time, since this would
give different algorithms for different builds, and that's
a problem for such things as debugging. I'm not even sure
I'd want different algorithms on different architectures
unless they're radically different from any I know.
I suggest a simple model like this:
time to create hash table = m1 + m2 x size
time to add to hash table = m3
time to test membership in table = m4
time to test membership in list = m5 x size
where all of the m's are to me measured relatively
infrequently (less than once per clisp version), stored
in source files and treated as constants.
The list version cost of set-difference is then estimated
as (* (length x) (* m5 (length y)))
while the table version cost is estimated as
(+ m1 (* (+ m2 m3) (length y)) (* (length x) m4))
You might want to compute a few more constants that allow
you to avoid the computation of expected costs in at least
the smallest cases:
First, of course, if x or y is actually empty then you
know the answer.
Next
if m4 > (* m5 (length y)) ;; list vs last term of table
then clearly list is better. That can be tested by
(< (length y) #.(floor m4 m5)) ;; floor so as to compare ints
Similarly,
if (+ m2 m3) > (* m5 (length x)) ;; list vs 2nd term of table
which can be tested by
(< (length x) #.(floor (+ m2 m3) m5))
I figure in the cases where both x and y are non-empty the cost
is already at least (length x) + (length y) so you haven't lost
much by testing those two things.
In the case where neither is true I think it's worth your time
to evaluate the estimated cost formulae.
====
Things are much more complicated for equal and other more complex
tests, since the costs of the test and hash depend on the arguments
and the two algorithms do different numbers of those operations!
I can imagine all sorts of approaches, such as
- assume the cost of test and hash (m3, m4, m5) are very high
- assume the cost of test and hash are very low
- examine the data at run time to decide
- execute both in parallel
(so you're at least within a factor of 2 of the better one)
Note that it's even possible for the cost of hashing
to be very high while the cost of testing is very low
(if all of the objects to be compared are large but differ
in the first place you look).
I suggest in any case that you implement the two algorithms
as separate functions and make them both available in the
ext package so the user who knows which one he wants can
just call it by name.
Off hand I think I'd prefer the approach of assuming test and hash
are both expensive but I expect there are lots of arguments that
I've not yet considered.
I hadn't actually intended for this to turn into a research project
but I'm afraid it's already too late.

Sam wrote:
> > You need to time one against the other.
>
> I remember you said that some platforms produce unreliable timings.
> which ones are reliable?
A Linux without background jobs, on which you're the only user, and on which
you don't touch the mouse or keyboard while the task is running, is good.
(Yes, the run-time timings usually change by a few percent just from
clicking into or scrolling another window.)
Also, for every timings, use three runs, and ignore the first run.
> is LENGTH always cheap enough?
LENGTH, <= etc. are cheap.
> > No, its only purpose is to ease future maintenance of the code.
> then it defeats the purpose.
It guarantees that the two branches of the 'if' are in sync. Which reduces
by a factor of 2 the amount of code to proofread or understand.
Yes it looks funny, I know. Doesn't matter.
Bruno

Daniel Barlow wrote:
> That's not what strace sees: clisp is making repeated calls to read() and
> filling up its buffer a bit at a time.
Yes, this was surprising me. I thought that kind of behaviour was
system dependent (BSD vs. SysV). But actually that's what POSIX specifies;
so that means even BSD Unices now must return partial reads for pipes,
ttys and sockets. (Do they? I haven't tested it.)
> > If you want the data to be returned immediately, use an unbuffered
> > socket.
>
> Surely that would also cause outgoing data to be unbuffered -
> i.e. repeated WRITE-CHAR calls translating to individual write()
> calls.
That's not the case in clisp. In clisp a bidirectional socket stream is
actually a two-way-stream; you can control both sides individually.
> Although this is certainly one possible behaviour for buffered
> streams, it's different from (and, I would argue, less useful than)
> the behaviour exhibited by C stdio buffers, or Perl buffers, or even
> SBCL/CMUCL stream buffers, where a call to the upper-level function to
> read a line may cause an attempt to read 4 or 8k but still returns as
> soon as there's a line to be read, saving the rest of the data for
> later.
I see what you mean. There are three ways to read from a file-descriptor
at POSIX level:
(a) read the whole buffer, hang if needed
(b) read a positive amount of bytes, hang if needed
(c) read all that's immediately available, return immediately
(a) is also known as gnulib full_read. (b) is known as gnulib safe_read.
read() behaves
like (a) on files,
like (b) on pipes, ttys, sockets,
like (c) on files, pipes, ttys, sockets when O_NONBLOCK has been set.
CLISP effectively offers:
(a) READ with :BUFFERED T
(b) READ with :BUFFERED NIL
(c) no READ, just LISTEN and READ-CHAR-NO-HANG
What you want, is the combination of (b) with :BUFFERED T. On files
it will be identical to (a), but on pipes, ttys, sockets it will make a
difference.
Let's call it :BUFFERED :INTERACTIVE.
I think I know how to implement this now...
Thanks for the suggestion.
Bruno

Sam wrote:
> suppose I have 2 lists: length 5 and length 1000.
> does it make sense to convert the 1000-long list to an HT?
Maybe. You need to time one against the other.
> what's better - 1000 searches of length 5 or 5 searches of length 1000?
5 searches of length 1000 is better. It's always better to let the inner
loop be the longer one (except when you're doing matrix operations and have
to respect to locality of memory accesses; this is random or unpredictable
when dealing with lists like here).
> any extra magic will probably slow down the super small cases (lists of
> only 3-5 members)
Not by much, if the test is simple.
> > (if ...
> > (macrolet ((member-test ...))
> > #1=
> > (dolist (item list1)
> > (unless (member-test item)
> > (setq list1-filtered (cons item list1-filtered)))))
> > (macrolet ((member-test ...))
> > #1#))
>
> will this produce an identical win in compiled code size?!
No, its only purpose is to ease future maintenance of the code.
Bruno

> * Bruno Haible <oehab@...> [2004-10-29 18:40:18 +0200]:
>
> Sam wrote:
>> > It is fixed in CVS now.
>>
>> thanks a lot!!
>> now, may I dare to ask you to check this into the patched branch too
>
> No, this patch is not essential enough for the patch branch: UQ_to_I
> is used only by get-internal-*-time and by the file size information
> of DIRECTORY results.
OUCH!!! I do use the file size as reported by DIRECTORY!!!
(I don't expect 4GB files though yet.)
> You can use
> (read-from-string (write-to-string (get-internal-run-time)))
> as a workaround.
get-internal-run-time is usually fast and non-consing (and it will be
always non-consing when you implement 48-bit fixnums :-).
this "workaround" is slow and it always conses.
I guess I will have to do it myself then.
--
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org&gt; <http://www.iris.org.il&gt; <http://www.memri.org/&gt;
<http://www.mideasttruth.com/&gt; <http://www.honestreporting.com&gt;
Experience always comes right after it would have been useful.

> * Bruno Haible <oehab@...> [2004-10-29 18:59:20 +0200]:
>
> Sam wrote:
>> Set functions now use hash-tables when possible.
>
> Hash-tables are certainly good for big lists. However, small lists (of
> size < 10) are the most frequent ones. Are you sure that your change
> didn't slow down the frequent case?
I am sure I did :-)
> I would recommend to find out the threshold from which the hash table
> solution is more efficient, and continue to use the old list-based
> (and non-consing!) approach for the "small lists" case.
suppose I have 2 lists: length 5 and length 1000.
does it make sense to convert the 1000-long list to an HT?
what's better - 1000 searches of length 5 or 5 searches of length 1000?
any extra magic will probably slow down the super small cases (lists of
only 3-5 members)
>> I don't see how I can make the variable MEMBER? value inlined.
>> The alternative is to double the code size for each function...
>
> Doubling the code size doesn't matter here. You can avoid code
> duplication by doing
>
> (if ...
> (macrolet ((member-test ...))
> #1=
> (dolist (item list1)
> (unless (member-test item)
> (setq list1-filtered (cons item list1-filtered)))))
> (macrolet ((member-test ...))
> #1#))
will this produce an identical win in compiled code size?!
--
Sam Steingold (http://www.podval.org/~sds) running w2k
<http://www.camera.org&gt; <http://www.iris.org.il&gt; <http://www.memri.org/&gt;
<http://www.mideasttruth.com/&gt; <http://www.honestreporting.com&gt;
The only intuitive interface is the nipple. The rest has to be learned.

Sam wrote:
> Set functions now use hash-tables when possible.
Hash-tables are certainly good for big lists. However, small lists (of size
< 10) are the most frequent ones. Are you sure that your change didn't slow
down the frequent case?
I would recommend to find out the threshold from which the hash table solution
is more efficient, and continue to use the old list-based (and non-consing!)
approach for the "small lists" case.
> I don't see how I can make the variable MEMBER? value inlined.
> The alternative is to double the code size for each function...
Doubling the code size doesn't matter here. You can avoid code duplication by
doing
(if ...
(macrolet ((member-test ...))
#1=
(dolist (item list1)
(unless (member-test item)
(setq list1-filtered (cons item list1-filtered)))))
(macrolet ((member-test ...))
#1#))
Bruno

Sam wrote:
> > It is fixed in CVS now.
>
> thanks a lot!!
> now, may I dare to ask you to check this into the patched branch too
No, this patch is not essential enough for the patch branch: UQ_to_I
is used only by get-internal-*-time and by the file size information of
DIRECTORY results.
You can use
(read-from-string (write-to-string (get-internal-run-time)))
as a workaround.
Bruno

Community

Help

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

I agree to receive quotes, newsletters and other information from sourceforge.net and its partners regarding IT services and products. I understand that I can withdraw my consent at any time. Please refer to our Privacy Policy or Contact Us for more details