Re: indirect threading for bytecode interpreter

From:

Stefan Monnier

Subject:

Re: indirect threading for bytecode interpreter

Date:

Thu, 17 Sep 2009 20:59:38 -0400

User-agent:

Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (gnu/linux)

Helmut> 5% doesn't sound like a lot to some people.
> Shrug. Obviously I think the tradeoff is worth it, or I would not have
> sent the patch. I don't think the result is all that ugly. And,
> importantly, it is very low-hanging fruit.
I agree it doesn't seem that ugly. Looking at
http://lists.gnu.org/archive/html/emacs-devel/2004-05/txt1OKi7Cs5BI.txt
again, I like his use of
# define OPLABL(X) [X] = &&lbl_ ##X
to initialize the table, making sure that it's initialized correctly
(no need for your sanity checks).
Helmut> vmgen sounds like a good idea, but I fear that it makes the build
Helmut> process quite a bit more complicated.
> You can check in the generated code.
IIUC the code it generates may depend on the platform (tho, it can
probably output platform-independent code as well, I guess).
> vmgen is a nice idea.
Yes, and it could bring yet more optimization tricks, for free.
> I rejected writing this as a direct-threaded interpreter because
> I assumed that the added memory use would be a bad tradeoff. But, if
> you are interested in that, perhaps I could take a stab at it.
I think a direct-threaded interpreter would take a bit more work,
because you need to replace the bytecode with "word-code". I don't know
how much of an impact it would have on memory use.
Helmut> I'm wondering why gcc can't perform this transformation from the
Helmut> switch based code. Is there no compiler setting to skip the
Helmut> range check in the switch statement?
> It isn't about range checking but about eliminating a jump during the
> dispatch.
Actually, IIRC Anton Ertl (vmgen's author) has some articles which
indicate that a big part of the win isn't just the removal of some
instructions, but more importantly the multiplication of "jump to next
target": instead of having only 1 computed jump to the next byte-code
target (plus N jumps back to the starting point), you have N computed
jumps, so each one can be predicted independently. The single computed
jump in gcc's output code is terribly difficult for the CPU to predict,
leading to lots and lots of cycles wasted due to mispredictions.
The N computed jumps aren't very easy to predict either, but some of
them at least are a bit easier, because some sequences of byte-code are
more common than others, so the CPU's jump prediction can fail a bit
less often, leading to fewer wasted cycles.
Anton had some experiments where he duplicated some byte-codes, and
showed that it could also improve performance (again, by making the
jumps more predictable: the actual executed instructions were exactly
identical).
> GCC could be taught to do this. I imagine that it has always been
> simpler for people to just update their interpreter than it has been to
> try to fix GCC.
IIRC, some people experimented with gcc to teach it to do this kind of
copy the initial jump to the end of each block, but IIRC it was
difficult for gcc to tell automatically when it was a good idea and when
it wasn't. After all, by duplicating this code, you increase the code
size, and if each branch's prediction is pretty much identical, you
might be better off with a single jump so the prediction data from one
branch helps the other branches as well (as so it doesn't use up as
much space in the jump prediction table).
For interpreters it's almost always a good thing to do, because a lot of
execution time will be spent in this loop+switch, but in general it's
not that clear cut.
> I don't think that some possible future GCC change should affect whether
> this patch goes in.
No, indeed.
Stefan