Sunday, February 9, 2014

Rewrites of the STM core model -- again

Hi all,

A quick note about the Software Transactional Memory (STM) front.

Since the previous
post, we believe we progressed a lot by discovering an alternative
core model for software transactions. Why do I say "believe"? It's
because it means again that we have to rewrite from scratch the C
library handling STM. This is currently work in progress. Once this is
done, we should be able to adapt the existing pypy-stm to run on top of
it without much rewriting efforts; in fact it should simplify the
difficult issues we ran into for the JIT. So while this is basically
yet another restart similar to last
June's, the difference is that the work that we have already put in the PyPy
part (as opposed to the C library) remains.

You can read about the basic ideas of this new C library here.
It is still STM-only, not HTM, but because it doesn't constantly move
objects around in memory, it would be easier to adapt an HTM version.
There are even potential ideas about a hybrid TM, like using HTM but
only to speed up the commits. It is based on a Linux-only system call, remap_file_pages()
(poll: who heard about it before? :-). As previously, the work is done
by Remi Meier and myself.

Currently, the C library is incomplete, but early experiments show good
results in running duhton,
the interpreter for a minimal language created for the purpose of
testing STM. Good results means we brough down the slow-downs from
60-80% (previous version) to around 15% (current version). This number
measures the slow-down from the non-STM-enabled to the STM-enabled
version, on one CPU core; of course, the idea is that the STM version
scales up when using more than one core.

This means that we are looking forward to a result that is much better
than originally predicted. The pypy-stm has chances to run at a
one-thread speed that is only "n%" slower than the regular pypy-jit, for
a value of "n" that is optimistically 15 --- but more likely some number
around 25 or 50. This is seriously better than the original estimate,
which was "between 2x and 5x". It would mean that using pypy-stm is
quite worthwhile even with just two cores.

More updates later...

Armin

Hi all,

A quick note about the Software Transactional Memory (STM) front.

Since the previous
post, we believe we progressed a lot by discovering an alternative
core model for software transactions. Why do I say "believe"? It's
because it means again that we have to rewrite from scratch the C
library handling STM. This is currently work in progress. Once this is
done, we should be able to adapt the existing pypy-stm to run on top of
it without much rewriting efforts; in fact it should simplify the
difficult issues we ran into for the JIT. So while this is basically
yet another restart similar to last
June's, the difference is that the work that we have already put in the PyPy
part (as opposed to the C library) remains.

You can read about the basic ideas of this new C library here.
It is still STM-only, not HTM, but because it doesn't constantly move
objects around in memory, it would be easier to adapt an HTM version.
There are even potential ideas about a hybrid TM, like using HTM but
only to speed up the commits. It is based on a Linux-only system call, remap_file_pages()
(poll: who heard about it before? :-). As previously, the work is done
by Remi Meier and myself.

Currently, the C library is incomplete, but early experiments show good
results in running duhton,
the interpreter for a minimal language created for the purpose of
testing STM. Good results means we brough down the slow-downs from
60-80% (previous version) to around 15% (current version). This number
measures the slow-down from the non-STM-enabled to the STM-enabled
version, on one CPU core; of course, the idea is that the STM version
scales up when using more than one core.

This means that we are looking forward to a result that is much better
than originally predicted. The pypy-stm has chances to run at a
one-thread speed that is only "n%" slower than the regular pypy-jit, for
a value of "n" that is optimistically 15 --- but more likely some number
around 25 or 50. This is seriously better than the original estimate,
which was "between 2x and 5x". It would mean that using pypy-stm is
quite worthwhile even with just two cores.

I was wondering how to use this call when I learnt of it, but couldn't figure anything out except possibly database applications (similar) and sort algorithms (too limited). I think this call may be used when manipulating framebuffer too, there was something about having multiple mappings [to hardware] some readonly, some not.

I would like to [possibly] disagree with your statement in c7 README "Most probably, this comes with no overhead once the change is done..."

TLB cache is a limited resource and may easily be contended on large systems. Regular mmap could [in theory] use huge TLB pages, remapped individual pages cannot.

In addition there is a small penalty during first access to the remapped page, though you may consider it amortized depending on remap/reuse ratio.

Granted it's still small stuff.

Reserving one register is is a cool trick, and I find quite acceptable. It too has a small penalty, but the benefits surely outweigh those!

@Dina: Thanks for the feedback! Note that "%gs" is a special register that is usually not used: there is no direct way to read/write its actual value. It needs to be done with a syscall, at least before very recent CPUs. It can only be used in addressing instructions as an additional offset.