sbcl-devel

Hi,
The change below breaks OpenBSD/PPC. It does not break Darwin/PPC, nor OpenBSD/x86 or OpenBSD/AMD64.
I've attached a log from a broken build. I've also attached the world's silliest patch since all it
does is remove part of the commit below.
If someone has an idea I'm happy to debug and explore more. I don't understand the bit of code
that is broken and haven't gotten very far to understand why it is broken.
Thanks
bruce
commit 52b1041d3a14eaa4e45f6d8edfbdc0dec4292239
Author: Christophe Rhodes <csr21@...>
Date: Thu Apr 5 19:55:05 2012 +0100
Fix bug in unsigned modular arithmetic using a signed implementation
If we aim to be clever by implementing an unsigned modular arithmetic
computation using signed arithmetic, we need to make sure that we
don't accidentally contaminate the computation with any extraneous
high bits. This means that we must be sure to cut constants to the
appropriate width, as well as computations, so do so; this fixes
bug #974406 from Paul Dietz. (In addition the change from cutting
to the requested width to the implementation width fixes #903821,
so Go Team!)
Test cases. Minimally horrible test case for #903821; far worse
suggestions were made on #sbcl IRC...

Hi,
The change below breaks OpenBSD/PPC. It does not break Darwin/PPC, nor OpenBSD/x86 or OpenBSD/AMD64.
I've attached a log from a broken build, and, a log from a patched build. I've also attached the
world's silliest patch since all it does is remove part of the commit below.
If someone has an idea I'm happy to debug and explore more. I don't understand the bit of code
that is broken and haven't gotten very far to understand why it is broken.
Thanks.
bruce
commit 52b1041d3a14eaa4e45f6d8edfbdc0dec4292239
Author: Christophe Rhodes <csr21@...>
Date: Thu Apr 5 19:55:05 2012 +0100
Fix bug in unsigned modular arithmetic using a signed implementation
If we aim to be clever by implementing an unsigned modular arithmetic
computation using signed arithmetic, we need to make sure that we
don't accidentally contaminate the computation with any extraneous
high bits. This means that we must be sure to cut constants to the
appropriate width, as well as computations, so do so; this fixes
bug #974406 from Paul Dietz. (In addition the change from cutting
to the requested width to the implementation width fixes #903821,
so Go Team!)
Test cases. Minimally horrible test case for #903821; far worse
suggestions were made on #sbcl IRC...

"Bruce O'Neel" <ecl@...> writes:
> Hi,
>
> The change below breaks OpenBSD/PPC. It does not break Darwin/PPC, nor OpenBSD/x86 or OpenBSD/AMD64.
>
> I've attached a log from a broken build. I've also attached the world's silliest patch since all it
> does is remove part of the commit below.
>
> If someone has an idea I'm happy to debug and explore more. I don't understand the bit of code
> that is broken and haven't gotten very far to understand why it is broken.
It's always possible that it's my patch that is broken, but it's also
possible that something else is broken somewhere, which has been exposed
by what I think this fixes, either directly or just because of random
movement of stuff. That said, could someone familiar with IR1 and
modular arithmetic please check the bisected commit?
52b1041d3a14eaa4e45f6d8edfbdc0dec4292239 is the decreed culprit.
The new bit in this commit is that, in code of the form
(logand (logfoo ...) <some bignum>)
where the compiler can /prove/ that the result of the logand is always
some kind of machine integer, the <some bignum> is truncated to the
appropriate machine integer width. (Since the higher bits were proven
to be irrelevant to the computation, they might as well be thrown away).
The commit might be wrong. There is probably a reason that Alexey
didn't do it way back when. But I am surprised by this, and the fact
that's it's broken (I presume deterministically) on just one platform
suggests to me that it's more likely that something else is the matter.
(If this were a problem with the patch, I'd expect this to show up more
or less deterministically on every platform, because it is
platform-agnostic).
The point at which it breaks is also the point at which, I believe, the
lisp side does its first call to open() -- it's successfully executed
the first few forms in make-target-2.lisp, as you can see from the
echoed output at the end of the build log, and then dies before doing
anything much at all of the (load "src/cold/warm.lisp") form. If the
ABI details were wrong, for example stack alignment, that could show up
as memory corruption at this point, for example if the return value from
open() isn't where the lisp expects it to be.
Any other ideas?
Cheers,
Christophe

Hi,
I think your update is just fine, it's just triggered something out of
date and/or broken in OpenBSD/PPC. It's not like OpenBSD/PPC is a
common platform.
Eric Marsden <eric.marsden@...> reported a bug back in December
about ldb not working on PPC.
On my amd64 system:
* (defun one (a)
(ldb (byte 9 27) a))
ONE
* (defun two (a)
(declare (type (integer -3 57216651) a))
(ldb (byte 9 27) a))
TWO
* (- (one 10) (two 10))
0
*
On my OpenBSD/PPC system:
* (defun one (a)
(ldb (byte 9 27) a))
ONE
* (defun two (a)
(declare (type (integer -3 57216651) a))
(ldb (byte 9 27) a))
TWO
* (- (one 10) (two 10))
-320
*
I'll explore this a bit. I'll also poke at different limits that might be slightly or bizzarly different one OpenBSD/PPC vs other
OpenBSD systems.
Thanks!
cheers
bruce
On Mon, Apr 16, 2012 at 09:21:40PM +0100, Christophe Rhodes wrote:
> "Bruce O'Neel" <ecl@...> writes:
>
> > Hi,
> >
> > The change below breaks OpenBSD/PPC. It does not break Darwin/PPC, nor OpenBSD/x86 or OpenBSD/AMD64.
> >
> > I've attached a log from a broken build. I've also attached the world's silliest patch since all it
> > does is remove part of the commit below.
> >
> > If someone has an idea I'm happy to debug and explore more. I don't understand the bit of code
> > that is broken and haven't gotten very far to understand why it is broken.
>
> It's always possible that it's my patch that is broken, but it's also
> possible that something else is broken somewhere, which has been exposed
> by what I think this fixes, either directly or just because of random
> movement of stuff. That said, could someone familiar with IR1 and
> modular arithmetic please check the bisected commit?
> 52b1041d3a14eaa4e45f6d8edfbdc0dec4292239 is the decreed culprit.
>
> The new bit in this commit is that, in code of the form
> (logand (logfoo ...) <some bignum>)
> where the compiler can /prove/ that the result of the logand is always
> some kind of machine integer, the <some bignum> is truncated to the
> appropriate machine integer width. (Since the higher bits were proven
> to be irrelevant to the computation, they might as well be thrown away).
>
> The commit might be wrong. There is probably a reason that Alexey
> didn't do it way back when. But I am surprised by this, and the fact
> that's it's broken (I presume deterministically) on just one platform
> suggests to me that it's more likely that something else is the matter.
> (If this were a problem with the patch, I'd expect this to show up more
> or less deterministically on every platform, because it is
> platform-agnostic).
>
> The point at which it breaks is also the point at which, I believe, the
> lisp side does its first call to open() -- it's successfully executed
> the first few forms in make-target-2.lisp, as you can see from the
> echoed output at the end of the build log, and then dies before doing
> anything much at all of the (load "src/cold/warm.lisp") form. If the
> ABI details were wrong, for example stack alignment, that could show up
> as memory corruption at this point, for example if the return value from
> open() isn't where the lisp expects it to be.
>
> Any other ideas?
>
> Cheers,
>
> Christophe

"Bruce O'Neel" <ecl@...> writes:
> On my OpenBSD/PPC system:
>
> * (defun one (a)
> (ldb (byte 9 27) a))
> ONE
> * (defun two (a) > (declare (type (integer -3 57216651) a)) > (ldb (byte 9 27) a))
>
> TWO
> * (- (one 10) (two 10))
>
> -320
> *
Ahaha. I'm slightly scared: either this is wrong on all PPC platforms,
or I don't understand, or, wait, maybe this is something to do with the
top half of what are really 64-bit registers being assumed to be zero
under some circumstances? And other platforms' ABIs making that happen
but not OpenBSD?
This is exactly the kind of thing that would be exposed more with my
patch: there would be more hardware arithmetic and fewer full calls to
"generic" (in the generic sense :-) Lisp functions.
> I'll explore this a bit. I'll also poke at different limits that
> might be slightly or bizzarly different one OpenBSD/PPC vs other
> OpenBSD systems.
Cheers,
Christophe

"Bruce O'Neel" <ecl@...> writes:
> I tried a small fix to the patch, ppc-ldb2-beo.diff. Sadly that just gets me back
> to the corrupted image problem as before.
Your fix looks right (and what Paul intended); I am surprised that you
get back to the corrupted image problem -- unless it's somehow
nondeterministic. I think what might be needed is a little bit of close
attention to this problem; things to do:
* maybe start with my patch, which is conservative, and verify that it
does in fact build;
* save away all the cross-compiled fasls
* change things to the corrected version of Paul's patch, and run the
build again
* run comparisons on the working and broken cross-compiled fasls. If
we're lucky, there should be few differences, and the differences
should be significant. (I did some work a while ago to make this
possible, by eliminating many kinds of trivial differences in
cross-compiled fasls; I don't know how much it will have bitrotted in
the intervening years).
Alternatively, the time-honoured "think very hard" process might have to
be used...
Cheers,
Christophe

Hi,
You patch works just fine. It both builds as well as fixes the LDB
problem on PPC.
I'll try the comparisons with the fasls. I probably can't do that until
early next week though.
Thanks!
cheers
bruce
On Fri, Apr 20, 2012 at 04:44:06PM +0100, Christophe Rhodes wrote:
> "Bruce O'Neel" <ecl@...> writes:
>
> > I tried a small fix to the patch, ppc-ldb2-beo.diff. Sadly that just gets me back
> > to the corrupted image problem as before.
>
> Your fix looks right (and what Paul intended); I am surprised that you
> get back to the corrupted image problem -- unless it's somehow
> nondeterministic. I think what might be needed is a little bit of close
> attention to this problem; things to do:
>
> * maybe start with my patch, which is conservative, and verify that it
> does in fact build;
> * save away all the cross-compiled fasls
> * change things to the corrected version of Paul's patch, and run the
> build again
> * run comparisons on the working and broken cross-compiled fasls. If
> we're lucky, there should be few differences, and the differences
> should be significant. (I did some work a while ago to make this
> possible, by eliminating many kinds of trivial differences in
> cross-compiled fasls; I don't know how much it will have bitrotted in
> the intervening years).
>
> Alternatively, the time-honoured "think very hard" process might have to
> be used...
>
> Cheers,
>
> Christophe

Hi,
Is there a reason why Paul's patch only patches
src/compiler/ppc/vm.lisp
and Christophe's patch patches both
src/compiler/ppc/vm.lisp
and
src/compiler/ppc/arith.lisp
??
If I combined Paul's patch with my fix of src/compiler/ppc/vm.lisp with Christophe's patch of src/compiler/ppc/arith.lisp
do you think that would be a good start?
Thanks.
bruce
On Fri, Apr 20, 2012 at 04:44:06PM +0100, Christophe Rhodes wrote:
> "Bruce O'Neel" <ecl@...> writes:
>
> > I tried a small fix to the patch, ppc-ldb2-beo.diff. Sadly that just gets me back
> > to the corrupted image problem as before.
>
> Your fix looks right (and what Paul intended); I am surprised that you
> get back to the corrupted image problem -- unless it's somehow
> nondeterministic. I think what might be needed is a little bit of close
> attention to this problem; things to do:
>
> * maybe start with my patch, which is conservative, and verify that it
> does in fact build;
> * save away all the cross-compiled fasls
> * change things to the corrected version of Paul's patch, and run the
> build again
> * run comparisons on the working and broken cross-compiled fasls. If
> we're lucky, there should be few differences, and the differences
> should be significant. (I did some work a while ago to make this
> possible, by eliminating many kinds of trivial differences in
> cross-compiled fasls; I don't know how much it will have bitrotted in
> the intervening years).
>
> Alternatively, the time-honoured "think very hard" process might have to
> be used...
>
> Cheers,
>
> Christophe

On 2012-04-20, at 5:30 PM, Bruce O'Neel wrote:
> Is there a reason why Paul's patch only patches
>
> src/compiler/ppc/vm.lisp
>
> and Christophe's patch patches both
>
> src/compiler/ppc/vm.lisp
>
> and
>
> src/compiler/ppc/arith.lisp
>
> ??
>
> If I combined Paul's patch with my fix of src/compiler/ppc/vm.lisp with Christophe's patch of src/compiler/ppc/arith.lisp
> do you think that would be a good start?
Oh yes. My patch was only a refactoring of Christophe's vm.lisp changes.
Paul Khuong

"Bruce O'Neel" <ecl@...> writes:
> VOP LDB-C/FIXNUM A!14[NL0] {9 27} => t20[A0]
> RLWINM #<TN t20[A0]>, #<TN A!14[NL0]>, 5, 21, 29
Does this (the specialized version of the function, TWO) actually work
on any PPC platform? I have only run through this on paper, but this
clever implementation of ldb for fixnums only works, I think, if the top
bit under consideration fit in the original fixnum -- whereas we do this
for all values of size and position, as long as a fixnum is being
returned. Because RLWINM does a rotate, I think it mixes bits in which
don't want to be mixed into the answer. (I don't see how it can
possibly get negative numbers right, either, but maybe I am suffering
from a failure of imagination, or a lack of caffeine.)
Cheers,
Christophe

Hi,
This patch not only fixes the LDB problem on both Darwin/PPC and OpenBSD/PPC,
but, it also fixes the build problem on OpenBSD/PPC.
I don't have easy access to a Linux/PPC system, but, I can test it on Monday
if you want.
Thanks very much!
cheers
bruce
On Tue, Apr 17, 2012 at 01:01:55PM +0100, Christophe Rhodes wrote:
> Christophe Rhodes <csr21@...> writes:
>
> > "Bruce O'Neel" <ecl@...> writes:
> >
> >> VOP LDB-C/FIXNUM A!14[NL0] {9 27} => t20[A0]
> >> RLWINM #<TN t20[A0]>, #<TN A!14[NL0]>, 5, 21, 29
> >
> > Does this (the specialized version of the function, TWO) actually work
> > on any PPC platform? I have only run through this on paper, but this
> > clever implementation of ldb for fixnums only works, I think, if the top
> > bit under consideration fit in the original fixnum -- whereas we do this
> > for all values of size and position, as long as a fixnum is being
> > returned. Because RLWINM does a rotate, I think it mixes bits in which
> > don't want to be mixed into the answer. (I don't see how it can
> > possibly get negative numbers right, either, but maybe I am suffering
> > from a failure of imagination, or a lack of caffeine.)
>
> So, if I'm right, there's the additional constraint for an RLWINM
> implementation of LDB that the size and position not exceed the original
> in-register representation of the integer. (Otherwise, the rotation
> will cause unwanted bits to appear). The attached patch is a completely
> untested attempt at that; some assembly might be needed to make it work
> properly.
>
>
> Cheers,
>
> Christophe

"Bruce O'Neel" <ecl@...> writes:
> This patch not only fixes the LDB problem on both Darwin/PPC and
> OpenBSD/PPC, but, it also fixes the build problem on OpenBSD/PPC.
Excellent.
> I don't have easy access to a Linux/PPC system, but, I can test it on
> Monday if you want.
I don't have access to any PPC systems at all. The patch needs at a
minimum a test case or several, to try to stop this coming in again. I
think it's also a little bit too conservative on signed-num and
unsigned-num primitive type arguments; shouldn't there be
signed=>unsigned and unsigned=>unsigned variants of %%LDB, with size and
position ranges (integer 1 32) and (integer 0 31) respectively,
retaining the constraint that size + position <= 32.
Also, Paul Khuong was threatening on IRC to submit a cleaned-up version
of my patch -- Paul, what's your plan?
Cheers,
Christophe

Christophe Rhodes <csr21@...> writes:
> I think it's also a little bit too conservative on signed-num and
> unsigned-num primitive type arguments; shouldn't there be
> signed=>unsigned and unsigned=>unsigned variants of %%LDB, with size
> and position ranges (integer 1 32) and (integer 0 31) respectively,
> retaining the constraint that size + position <= 32.
Speaking to myself, but: one problem with this idea might be that the
decision to convert to %%LDB (i.e. rlwinm) is made before any
representation selection pass, and that this would therefore constrain
representation selection. But I'm not sure if I can think of a case
where this constraint ends up being bad; the edge case is probably
something like
(defun foo (x)
(declare (type (unsigned-byte 32) x))
(list (1+ x) (ldb (byte 8 24) x)))
where the (1+ x) pretty much forces X to be a bignum. But on the other
hand, the low-level implementation of the LDB in terms of shifts and
masks will end up using an unsigned-num representation, so maybe having
an unsigned-num (and signed-num) version of %%LDB is no harm after all?
Cheers,
Christophe

On 2012-04-18, at 4:21 AM, Christophe Rhodes wrote:
> I don't have access to any PPC systems at all. The patch needs at a
> minimum a test case or several, to try to stop this coming in again. I
> think it's also a little bit too conservative on signed-num and
> unsigned-num primitive type arguments; shouldn't there be
> signed=>unsigned and unsigned=>unsigned variants of %%LDB, with size and
> position ranges (integer 1 32) and (integer 0 31) respectively,
> retaining the constraint that size + position <= 32.
Sounds right, although that kind of chance always make me afraid we're introducing more opportunities for bad representation selection.
> Also, Paul Khuong was threatening on IRC to submit a cleaned-up version
> of my patch -- Paul, what's your plan?
The attached patch should be the exact same, with less copy paste. Unfortunately, I don't have access to a PPC box either (:

Hi,
Thanks. Sadly the patch doesn't quite work. I attached the log file.
I tried a small fix to the patch, ppc-ldb2-beo.diff. Sadly that just gets me back
to the corrupted image problem as before.
Thanks.
cheers
bruce
On Wed, Apr 18, 2012 at 11:00:28AM -0400, Paul Khuong wrote:
> On 2012-04-18, at 4:21 AM, Christophe Rhodes wrote:
> > I don't have access to any PPC systems at all. The patch needs at a
> > minimum a test case or several, to try to stop this coming in again. I
> > think it's also a little bit too conservative on signed-num and
> > unsigned-num primitive type arguments; shouldn't there be
> > signed=>unsigned and unsigned=>unsigned variants of %%LDB, with size and
> > position ranges (integer 1 32) and (integer 0 31) respectively,
> > retaining the constraint that size + position <= 32.
>
> Sounds right, although that kind of chance always make me afraid we're introducing more opportunities for bad representation selection.
>
> > Also, Paul Khuong was threatening on IRC to submit a cleaned-up version
> > of my patch -- Paul, what's your plan?
>
>
> The attached patch should be the exact same, with less copy paste. Unfortunately, I don't have access to a PPC box either (:
>