sbcl-devel

Hi, long time lurker, first time poster.
I was poking around the Python's back end (!!) to see where a peephole =
optimizer could go. As occasional comments in the mail archives have =
noted, there seems to be no place in the compiler that the full set of a =
function's assembly instructions are collected together. The only =
apparent way to add a peephole optimizer to the assembler is by either a =
massive rewrite to the way machine instructions are emitted by vops, or =
some bad kludge that tries to capture the assembly instructions as the =
vops are processed (and keep syncronized with the backpatches and =
fixups).
Is there another option, such as optimizing the code completely =
separately from the compiler? For example take a compiled function, use =
the disassembler to re-create the assembly code, then peephole process =
the assembly code and make any resulting changes to the machine code =
(either in parallel or queue up changes). You would also need to then =
update the labels, offsets and any padding for the machine code. =
Finally, bind the new machine instruction set to the proper slot of the =
symbol.
Another approach might be to read an existing fasl file in a streaming =
fashion, and optimize function code before writing it out to a new fasl =
file.
My question is does any of these alternatives make sense? In some ways =
it would be a nice tool separate from the compiler that can allow =
experimenting and benchmarking certain code optimizations (of the =
peephole variety, not the stuff Python already excels at) to find low =
hanging fruit.
DDL

On Tue, Apr 11, 2006 at 09:55:45PM -0400, DDL wrote:
> I was poking around the Python's back end (!!) to see where a peephole
> optimizer could go. As occasional comments in the mail archives have
> noted, there seems to be no place in the compiler that the full set of
> a function's assembly instructions are collected together. The only
> apparent way to add a peephole optimizer to the assembler is by either
> a massive rewrite to the way machine instructions are emitted by vops,
> or some bad kludge that tries to capture the assembly instructions
> as the vops are processed (and keep syncronized with the backpatches
> and fixups).
>
> My question is does any of these alternatives make sense? In some
> ways it would be a nice tool separate from the compiler that can allow
> experimenting and benchmarking certain code optimizations (of the
> peephole variety, not the stuff Python already excels at) to find low
> hanging fruit.
Another alternative which I think would be cleaner that your two proposals
would be to peephole optimize SBCL's IR2 representation--at least after
register allocation is done. At the very least, this would catch the case
on the x86 that VOPs (at least insofar as IR2 translation goes) cannot
express updates to memory locations. So if you have a loop index stored
in memory, you get the code:
mov %register, [%ebp+...]
add %register, 4
mov [%ebp+...], %register
when a simple
add [%ebp+...], 4
would have sufficed. This can be caught easily in IR2 and handled in a
clean fashion.
Peephole optimization on IR2 prior to register allocation might also
help, although I think some form of redundancy elimination--such as
value numbering--on IR2 prior to register allocation would be more
helpful. Doing this sort of thing on IR2 also is closer in spirit
to CMUCL's original goals of trying to do as much as possible in
machine-independent Lisp, rather than cluttering up the backends with
lots of architecture-specific code. Peephole optimization is, of course,
somewhat architecture-specific, but at least by doing in on IR2, perhaps
some common patterns can be caught *and* you don't have to deal with the
intricacies of each machine's instruction set.
--
Nathan | From Man's effeminate slackness it begins. --Paradise Lost
The last good thing written in C was Franz Schubert's Symphony Number 9.
--Erwin Dieterich

Nathan Froyd <froydnj@...> writes:
> Another alternative which I think would be cleaner that your two proposals
> would be to peephole optimize SBCL's IR2 representation--at least after
> register allocation is done.
I've been experimenting with this a bit.
First off, there seem to be patterns that would be a lot harder to express
in a regular peephole optimizer then one operating on IR2. For example:
MOVE tn1 -> tn2
MYSTERY-VOP tn2
where the MYSTERY-VOPs arguments are the only use of tn2, and a bunch of
other constraints are satisfied should be able to turn to
MYSTERY-VOP tn1
Second of all, I was wondering why you say _after_ register allocation?
It seems to be that since we're operating on IR we should be able to
do the peephole magic with just the TNs, and that register allocator
should be able to do a better job if we manage to nuke a few of them
by peepholing?
Cheers,
-- Nikodemus Schemer: "Buddha is small, clean, and serious."
Lispnik: "Buddha is big, has hairy armpits, and laughs."

On Mon, Dec 04, 2006 at 11:26:39AM +0200, Nikodemus Siivola wrote:
> Nathan Froyd <froydnj@...> writes:
>
> > Another alternative which I think would be cleaner that your two proposals
> > would be to peephole optimize SBCL's IR2 representation--at least after
> > register allocation is done.
>
> I've been experimenting with this a bit.
Cool! I have experimented with this a bit also, see below.
> First off, there seem to be patterns that would be a lot harder to express
> in a regular peephole optimizer then one operating on IR2. For example:
>
> MOVE tn1 -> tn2
> MYSTERY-VOP tn2
>
> where the MYSTERY-VOPs arguments are the only use of tn2, and a bunch of
> other constraints are satisfied should be able to turn to
>
> MYSTERY-VOP tn1
This example would seem to be trivial enough to accomplish, but when you
add in SC restrictions (e.g. OPTIMIZATIONS #36), things are not as easy
as they seem.
> Second of all, I was wondering why you say _after_ register allocation?
Doing peephole optimization before and after register allocation would
clearly be beneficial. I said after register allocation because that's
when stack operands become visible. On IRC last week, a bit of code
from inside Ironclad came up:
mov [ebp-28], eax
shl [ebp-28], 2
and [ebp-28], 1020
mov eax, [ebp-28]
which should of course be 'shl eax, 2 / and eax, 1020'. A smarter
register allocator should be able to produce the proper code. But I
understand peephole optimization and I do not understand the register
allocator. :)
As for experimenting with peephole optimization, a while back I wrote this
IR2 pass for recognizing when we had already loaded a constant in a
given basic block. So the pass turns:
mov eax, NIL
mov ebx, NIL
into:
mov eax, NIL
mov ebx, eax
This sort of code turns up in keyword argument parsing (although the
effectiveness of the transformation is by no means limited to that
scenario!).
The pass:
(in-package :sb-c)
(defun commonify-move-vops (component)
(do-ir2-blocks (block component)
(format *trace-output* "Commonifying moves for block ~A, IR1 block ~A~%~
block-info ~A~%"
block (ir2-block-block block)
(block-info (ir2-block-block block)))
(commonify-move-vops-in-block block)))
(defun commonify-move-vops-in-block (ir2block)
(let ((constant-to-tn-table (make-hash-table))
(tn-to-constant-table (make-hash-table)))
(declare (ignorable tn-to-constant-table))
(do ((vop (ir2-block-start-vop ir2block) (vop-next vop)))
((null vop))
(let ((vop-name (vop-info-name (vop-info vop))))
(case vop-name
(move
(let* ((tn-ref (vop-args vop))
(source (tn-ref-tn tn-ref))
(leaf (tn-leaf source)))
(when (and leaf (typep leaf 'constant))
(let ((value (constant-value leaf)))
(format *trace-output*
"Checking for duplicate for ~A~%" value)
(multiple-value-bind (tn foundp)
(gethash value constant-to-tn-table)
(cond
(foundp
(format *trace-output*
"Duplicate load of ~A into ~A, resetting to ~A in ~A~%"
value (tn-ref-tn (vop-results vop)) tn vop)
(setf (tn-ref-tn tn-ref) tn))
(t
(let ((result (tn-ref-tn (vop-results vop))))
(setf (gethash value constant-to-tn-table) result)))))))))
((list list* call-named)
;; KLUDGE
(clrhash constant-to-tn-table)))))))
It worked pretty well, although building SBCL with this pass failed.
One of the problems is that IR2 was not really designed to be optimized;
e.g. I think the lifetime pass threads the uses of TNs together and
this makes it difficult to insert other uses in later passes. (The
reason that SBCL failed to build is related to this: I do no updating
of lifetime information and/or def-use chains in my pass as written.)
Another problem is that basic blocks are split to satisfy local TN limits,
which limits the scope of optimization on single basic blocks. (Trying
to do optimizations *before* basic block splitting causes its own
set of problems, IIRC.)
Some slight modifications of IR2 are probably necessary if we are to
get really serious about starting to do optimizations on IR2.
--
Nathan | From Man's effeminate slackness it begins. --Paradise Lost
The last good thing written in C was Franz Schubert's Symphony Number 9.
--Erwin Dieterich

Nathan Froyd <froydnj@...> writes:
> This example would seem to be trivial enough to accomplish, but when you
> add in SC restrictions (e.g. OPTIMIZATIONS #36), things are not as easy
> as they seem.
Yep. I've been unable the express the restrictions in a way that would
not break the build for the general case, and would still do something
usefull.
> As for experimenting with peephole optimization, a while back I wrote this
> IR2 pass for recognizing when we had already loaded a constant in a
> given basic block. So the pass turns:
>
> mov eax, NIL
> mov ebx, NIL
>
> into:
>
> mov eax, NIL
> mov ebx, eax
>
> This sort of code turns up in keyword argument parsing (although the
> effectiveness of the transformation is by no means limited to that
> scenario!).
Nifty!
> (defun commonify-move-vops-in-block (ir2block)
> (let ((constant-to-tn-table (make-hash-table))
> (tn-to-constant-table (make-hash-table)))
> (declare (ignorable tn-to-constant-table))
> (do ((vop (ir2-block-start-vop ir2block) (vop-next vop)))
> ((null vop))
> (let ((vop-name (vop-info-name (vop-info vop))))
> (case vop-name
> (move
> (let* ((tn-ref (vop-args vop))
> (source (tn-ref-tn tn-ref))
> (leaf (tn-leaf source)))
> (when (and leaf (typep leaf 'constant))
> (let ((value (constant-value leaf)))
> (format *trace-output*
> "Checking for duplicate for ~A~%" value)
> (multiple-value-bind (tn foundp)
> (gethash value constant-to-tn-table)
> (cond
> (foundp
> (format *trace-output*
> "Duplicate load of ~A into ~A, resetting to ~A in ~A~%"
> value (tn-ref-tn (vop-results vop)) tn vop)
> (setf (tn-ref-tn tn-ref) tn))
The SETF here should probably be change-tn-ref-tn, no?
> (t
> (let ((result (tn-ref-tn (vop-results vop))))
> (setf (gethash value constant-to-tn-table) result)))))))))
> ((list list* call-named)
> ;; KLUDGE
> (clrhash constant-to-tn-table)))))))
>
> It worked pretty well, although building SBCL with this pass failed.
> One of the problems is that IR2 was not really designed to be optimized;
> e.g. I think the lifetime pass threads the uses of TNs together and
> this makes it difficult to insert other uses in later passes. (The
> reason that SBCL failed to build is related to this: I do no updating
> of lifetime information and/or def-use chains in my pass as written.)
> Another problem is that basic blocks are split to satisfy local TN limits,
> which limits the scope of optimization on single basic blocks. (Trying
> to do optimizations *before* basic block splitting causes its own
> set of problems, IIRC.)
>
> Some slight modifications of IR2 are probably necessary if we are to
> get really serious about starting to do optimizations on IR2.
*cough* SSA / SSI *cough*
;-)
One thing that annoys me out of all proportion is the way we get one
verify-arg-count per entry-point, which end up injecting multipl
identical error trap sequences for every function. It doesn't really
matter all that much since a trap sequence is only 5 bytes, but...
Cheers,
-- Nikodemus Schemer: "Buddha is small, clean, and serious."
Lispnik: "Buddha is big, has hairy armpits, and laughs."

On Mon, Dec 04, 2006 at 05:00:56PM +0200, Nikodemus Siivola wrote:
> > (setf (tn-ref-tn tn-ref) tn))
>
> The SETF here should probably be change-tn-ref-tn, no?
Ah, it probably should Very handy!
> One thing that annoys me out of all proportion is the way we get one
> verify-arg-count per entry-point, which end up injecting multipl
> identical error trap sequences for every function. It doesn't really
> matter all that much since a trap sequence is only 5 bytes, but...
One way around this would be the strategy that Allegro (IIUC) employs:
there is a common trampoline for funcalling, which unifies call counting,
function tracing, etc. etc. in one location. I think Duane Rettig also
claimed this scheme is easier on the processor's I-cache (less duplicate
code) and epsilon slower than SBCL's approach (since the trampoline
is always in cache). I think it would shave ~150-200k off an x86 core.
(You could also move some of the stack manipulation from a typical
calling sequence into this trampoline, too.)
Another possibility comes from OpenMCL. IIU OpenMCL's source correctly,
certain codes after a trap instruction denote particular frequent traps;
encoding them this way saves space compared to expanding them out. So,
instead of SBCL's current:
; 4FC8: CC0A BREAK 10 ; error trap
; 4FCA: 02 BYTE #X02
; 4FCB: 18 BYTE #X18 ; INVALID-ARG-COUNT-ERROR
; 4FCC: 4D BYTE #X4D ; ECX
we might have instead:
; 4FC8: CC9A BREAK 154 ; INVALID-ARG-COUNT-ERROR, ECX
which is three bytes shorter. The most common errors in SBCL are
1. OBJECT-NOT-TYPE
2. UNBOUND-SYMBOL (!)
3. INVALID-ARG-COUNT
4. OBJECT-NOT-LIST
5. LAYOUT-INVALID (I think)
Christophe graciously ran my script for calculating error frequency on
his lisp-world.core and found that the top four, at least, stay the
same. I think there's just enough space in one byte that these four
errors could be encoded in the byte directly after the BREAK and still
have space left over for the current trap codes. (It's also worth
noting that doing this would save <1% of the space of even Christophe's
lisp-world.core...every little bit counts, I guess.)
--
Nathan | From Man's effeminate slackness it begins. --Paradise Lost
The last good thing written in C was Franz Schubert's Symphony Number 9.
--Erwin Dieterich