Stress test failure with conditional-store

Description

(This assumes ccl::conditional-store should work even though it's not
officially supported.)

The RUN function below eventually hangs on Linux and Darwin. On 32-bit
it generally hangs sooner. The time until hanging decreases as the
thread count increases. The number of threads reported by
ccl:all-processes is relatively constant, and hanging still occurs
with a 2-second sleep added to the loop.

Unfortunately I could not reproduce with a CAS-based stack, and CAS
spin locks have been fixed (#1030), which leaves us with the more
complex CAS queue. I hope the implementation below is straightforward
enough. I'm confident that it is correct, so please double-check
before blaming the queue. I've included SBCL support for reference (it
runs on SBCL without a problem).

e.g., (NODE-CDR (QUEUE-TAIL queue)) is non-NIL, and ENQUEUE can't make any progress in that case. The TAIL of the queue in this case is the original NODE that was used to initialize the queue.

The only code in your example that sets the NODE-CDR of any node to DUMMY is the code in DEQUEUE that you seem to be saying isn't quite correct in some way that isn't significant. That code (as written) affects the queue's head and it's the tail's node-cdr that's affected; the head and the tail are of course the same node when the queue is empty. (Note also that DEQUEUE is checking to see if NEXT - the NODE-CDR of the queue's HEAD - to see if it's EQ to +DUMMY+; DEQUEUE is the only thing that sets a NODE's NODE-CDR to +DUMMY+, and nothing seems to set any NODE-CDR to NIL.)

I don't know that this isn't a CCL bug, but I have difficulty understanding what bug would cause a symbol to be stored in the NODE-CDR of a NODE instead of in the NODE-CAR, or in the wrong NODE structure. (The kinds of things that I can imagine going wrong tend to be less predictable/consistent than

Here's a possible scenario which would produce the queue instance you
gave. Suppose ENQUEUE and DEQUEUE are called concurrently. ENQUEUE
makes partial progress while DEQUEUE fully completes. Then for some
reason ENQUEUE doesn't complete.

For the sake of notation let's use conses instead of NODEs.

start: head = tail = (dummy . nil)

partial ENQUEUE: head = tail = (dummy . (:hello . nil)). Tail is

not updated yet.

complete DEQUEUE: head = (dummy . nil), tail = (dummy . dummy)

picking up where ENQUEUE left off, tail should be set to the

(:hello . nil) cons, which now looks like (dummy . nil). But ENQUEUE
doesn't set tail.

Replacing that SETF with a CONDITIONAL-STORE also seems to fix things. (Acquiring a lock also involves one or more "lock cmpxchg" instructions, and these instructions have the side-effect of imposing memory ordering.)

At the point where ENQUEUE does/did the SETF, only one thread should have had its CONDITIONAL-STORE succeed and should therefore be in a position to update the queue's tail. Using a lock or another conditional store isn't necessary to avoid contention, but it's entirely believable that using some form of memory barrier (to ensure that all threads/cores see the change in memory immediately) may very well be necessary to ensure that all threads have a consistent view of memory.

I tried making that new CONDITIONAL-STORE error if it returned NIL, and I made the other CONDITIONAL-STORE in ENQUEUE check to make sure that it didn't return NIL in a case where the store had succeeded. (The primitive that implements that case of CONDITIONAL-STORE can be interrupted if the GC suspends the thread and the code that handles that is complicated.) I didn't see either of those consistency checks fail;
at this point, I don't see a CCL bug here.

It seems to me that the single CAS in ENQUEUE provides necessary and
sufficient ordering.

After a thread writes to the tail slot, another thread may read the
old tail for some length of time. If so then that's fine -- CAS will
fail for some number of iterations, and no harm is done.

Therefore a write to the tail slot is always preceded by a valid read.
And a valid read must have been preceded by a completed write.
Therefore writes to the tail slot are ordered as a consequence of CAS
writes being ordered.

The claim that a memory barrier would solve this is not supported by
evidence or theory. If there is a fix that adds a barrier and keeps
the SETF then please show it. If you think CONDITIONAL-STORE is OK
then the problem might lie in SETF.

In write-barrier subprims that do an unconditional store (rplaca/rplacd,
set_hash_key, gvset) do the store as the first instruction.

In pc_luser_xp: do nothing if suspended at the first instruction of
these subprims. Otherwise, assume that the unconditional store has
already occurred and don't emulate it. (Other values may have been
stored in the same location since the thread was suspended.)