Compare and swap (casx): This instruction swaps the contents of one memory position allocated in the L2 data cache
with the value of a register. This means that this instruction always accesses a memory location in L2 cache.

How (On Solaris, possibly only in assembler?) do you allocated a piece of memory such that "This means that this instruction always accesses a memory location in L2 cache.".

That is:

How do you allocate memory "in the L2 cache"?

There is no further explanation of this in the paper. They do however mention a pointer chasing arrangement to produce consistent L2 cache misses, which makes sense and suggests they know what they are talking about.

How do you prevent it getting promoted to the L1 cache the first time it is accessed?

Their very purpose in using the instruction is to benefit from the low-impact high-latency of an L1 cache miss. They further go on to say:

Its latency is 39 cycles in T1 and between 20 and 30 cycles in T2 (in our experiments it takes almost always about 30 cycles). This instruction does not excessively stress the processor structures that could be used by the active thread. In fact, casx only uses one entry of the shared LSU structure that connects the core to the interconnection network. Moreover, the memory space requirements of using this instruction are very low since all the spin-locks can access the same memory position.

Which makes it unlikely that the above is a slip of their tongues or otherwise a misinterpretation of their meaning.

I'm trying to work out how to apply their work on a Intel processor. The Perl link is another attempt at trying to make efficient shared memory available to from Perl.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

This is less Solaris and more (Ultra)SPARC architecture, and that paper discusses some "OS" I've never heard of, "NetraDPS". I found some UltraSparc CPU documentation, but it is somewhat vague on the memory accesses issued by a CASX (or CASXA) instruction.

My vague interpretation of D.2.5.3 is that a CASX instruction will fetch the appropriate page into L2 cache if it is not already there, and then perform the exchange only in the L2 cache, and not force an immediate write-back to the main RAM. This somewhat matches the error behaviour from 16.9.1.7, where a conflict between ECC corrections and CASX instructions on writeback may occur.

My vague interpretation of D.2.5.3 is that a CASX instruction will fetch the appropriate page into L2 cache if it is not already there, and then perform the exchange only in the L2 cache, and not force an immediate write-back to the main RAM.

Thank you for the link and your interpretation. It still took a while for the penny to drop, but it has now.

On the T1 & T2, L1 (instruction & data) caches are per core. The L2 cache is shared between all cores. Compare & swap instructions are specifically designed for intra-thread & intra-core signalling, they therefore have to be coherent at the L2 cache.

The L2 cache coherency requirement is what causes their high latency; that they only affect a single L2 cache line, their low impact; thus making them perfect for the task of reducing spin-lock 'burn'.

Now to seek out the equivalent X64 instruction :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

L2 cache is managed by the hardware inside the CPU - everything goes through here, so you don't (can't) "allocate" in it. However, if you take care to keep your code efficiently small, then the L2 cache will end up containing your entire codeset, and thus run faster.

It looks like the casx instruction is giving ASM programmers some means to directly use the L2 cache, outside of the normal CPU hardware cache management.

If "casx" isn't an intel instruction, then forget about using it on that processor!

None of this seems relevant to perl shared memory though - where does perl come into this???

The question was purely about the particular semantics of the Solaris version as that was used by the researchers of the paper I was reading. It was important for me to understand those semantics so that I could work out whether the x64 equivalents were compatible with their algorithms. They are (kinda).

None of this seems relevant to perl shared memory though - where does perl come into this???

It is. Or rather it will be soon.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

When putting a smiley right before a closing parenthesis, do you:

Use two parentheses: (Like this: :) )
Use one parenthesis: (Like this: :)
Reverse direction of the smiley: (Like this: (: )
Use angle/square brackets instead of parentheses
Use C-style commenting to set the smiley off from the closing parenthesis
Make the smiley a dunce: (:>
I disapprove of emoticons
Other