might be worth your while

10 replies

Mon, 2011-12-12, 23:56

extempore

Joined: 2008-12-17,

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

/** @note Making foreach run as fast as a while loop is a challenge.
* The key elements which I can observe making a difference are:
*
* - the inner loop should be as small as possible
* - the inner loop should be monomorphic
* - the inner loop should perform no boxing and no avoidable tests
*
* This is achieved by:
*
* - keeping initialization logic out of the inner loop
* - dispatching to custom variations based on initial conditions
* - tricking the compiler into always calling Function1#apply$mcVI$sp
*
* The last one is important and less than obvious. Even when foreach
* was specialized on Unit, only Int => Unit arguments benefited from it.
* Other function types would be accepted, but in the absence of full
* specialization the integer argument was boxed on every call. For example:
*
class A {
final def f(x: Int): Int = x + 1
// Calls Range.foreach, which calls Function1.apply
def g1 = 1 until 100 foreach { x => f(x) }
// Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
def g2 = 1 until 100 foreach { x => f(x) ; () }
}
*
* However! Since the result of the closure is always discarded, we
* simply cast it to Int => Unit, thereby executing the fast version.
* The seemingly looming ClassCastException can never arrive.
*/

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

/** @note Making foreach run as fast as a while loop is a challenge.
* The key elements which I can observe making a difference are:
*
* - the inner loop should be as small as possible
* - the inner loop should be monomorphic
* - the inner loop should perform no boxing and no avoidable tests
*
* This is achieved by:
*
* - keeping initialization logic out of the inner loop
* - dispatching to custom variations based on initial conditions
* - tricking the compiler into always calling Function1#apply$mcVI$sp
*
* The last one is important and less than obvious. Even when foreach
* was specialized on Unit, only Int => Unit arguments benefited from it.
* Other function types would be accepted, but in the absence of full
* specialization the integer argument was boxed on every call. For example:
*
class A {
final def f(x: Int): Int = x + 1
// Calls Range.foreach, which calls Function1.apply
def g1 = 1 until 100 foreach { x => f(x) }
// Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
def g2 = 1 until 100 foreach { x => f(x) ; () }
}
*
* However! Since the result of the closure is always discarded, we
* simply cast it to Int => Unit, thereby executing the fast version.
* The seemingly looming ClassCastException can never arrive.
*/

It can't find the lifted methods to inline. I would believe it is a
bug in the inliner, which does (and doesn't do) a lot of things I
don't entirely understand. (Some of those things I will eventually
understand, but some are bugs I haven't fixed yet.)

It can't find the lifted methods to inline. I would believe it is a
bug in the inliner, which does (and doesn't do) a lot of things I
don't entirely understand. (Some of those things I will eventually
understand, but some are bugs I haven't fixed yet.)

DAmn, it's just one of those things I wish I had the time to get into and fix :(

On Sun, Sep 19, 2010 at 12:33 PM, martin odersky <martin [dot] odersky [at] epfl [dot] ch> wrote:
> I have dreamed for about 5 years now that for loops should be just
> as fast as while loops, just using normal optimizations that are
> applicable everywhere. That's why I was always against special-casing
> for loops over ranges which would have been easy to do.

It took me another year and change, but maybe dreams do come true,
in a limited sort of way.

What's the catch? Actually I was hoping you guys would work that out and
let me know. Suboptimalities I have observed so far include some things
not being inlined which were inlined before, and that the performance
parity does not make it all the way into nested foreaches. I think I'm
still out in front though.

You don't state it outright, but my sixth sense leads me to one inescapable conclusion: Awesome!

Does this depend on compiling with -optimize? Btw. I remember you
working on special casing Range.foreach some months ago with very
promising results. If I remember correctly Martin was against it, but
I think it had the advantage of not depending on -optimize and working
with nested foreachs, too. Both being very nice properties IMHO.

I wonder if Martin wouldn't change his mind now...

Regards,
Rüdiger

2011/12/12 Paul Phillips :
> On Sun, Sep 19, 2010 at 12:33 PM, martin odersky wrote:
>> I have dreamed for about 5 years now that for loops should be just
>> as fast as while loops, just using normal optimizations that are
>> applicable everywhere. That's why I was always against special-casing
>> for loops over ranges which would have been easy to do.
>
> It took me another year and change, but maybe dreams do come true,
> in a limited sort of way.
>
> [info] length benchmark ns linear runtime
> [info] 10 Foreach 10.35 =
> [info] 10 TFor 5.23 =
> [info] 10 While 4.94 =
> [info] 100 Foreach 39.33 =
> [info] 100 TFor 34.39 =
> [info] 100 While 29.44 =
> [info] 1000 Foreach 343.11 ===
> [info] 1000 TFor 329.17 ==
> [info] 1000 While 313.10 ==
> [info] 10000 Foreach 3333.05 ==============================
> [info] 10000 TFor 3281.15 =============================
> [info] 10000 While 3096.62 ===========================
>
> What's the catch? Actually I was hoping you guys would work that out and
> let me know. Suboptimalities I have observed so far include some things
> not being inlined which were inlined before, and that the performance
> parity does not make it all the way into nested foreaches. I think I'm
> still out in front though.
>
> https://github.com/scala/scala/commit/4cfc633fc6
>
> /** @note Making foreach run as fast as a while loop is a challenge.
> * The key elements which I can observe making a difference are:
> *
> * - the inner loop should be as small as possible
> * - the inner loop should be monomorphic
> * - the inner loop should perform no boxing and no avoidable tests
> *
> * This is achieved by:
> *
> * - keeping initialization logic out of the inner loop
> * - dispatching to custom variations based on initial conditions
> * - tricking the compiler into always calling Function1#apply$mcVI$sp
> *
> * The last one is important and less than obvious. Even when foreach
> * was specialized on Unit, only Int => Unit arguments benefited from it.
> * Other function types would be accepted, but in the absence of full
> * specialization the integer argument was boxed on every call. For example:
> *
> class A {
> final def f(x: Int): Int = x + 1
> // Calls Range.foreach, which calls Function1.apply
> def g1 = 1 until 100 foreach { x => f(x) }
> // Calls Range.foreach$mVc$sp, which calls Function1.apply$mcVI$sp
> def g2 = 1 until 100 foreach { x => f(x) ; () }
> }
> *
> * However! Since the result of the closure is always discarded, we
> * simply cast it to Int => Unit, thereby executing the fast version.
> * The seemingly looming ClassCastException can never arrive.
> */

On Tue, Dec 13, 2011 at 12:36 AM, Ruediger Keller
wrote:
> Does this depend on compiling with -optimize? Btw. I remember you
> working on special casing Range.foreach some months ago with very
> promising results. If I remember correctly Martin was against it, but
> I think it had the advantage of not depending on -optimize and working
> with nested foreachs, too. Both being very nice properties IMHO.
>
> I wonder if Martin wouldn't change his mind now...

Yes, it requires -optimise. If macros arrive on schedule this will be
a bit of a hollow victory since they'll offer what you're looking for.
Still, even a hollow victory is better than defeat.

On Tue, Dec 13, 2011 at 2:56 AM, Miguel Garcia wrote:
> we get both less baggage and the inlinings (all of them) we were looking
> for, in fewer than 140 instructions (including zero NEWs for anon-closu
> classes):

That sounds pretty promising. Can you or anyone reassure me that I
can introduce code which relies on overflow without losing any sleep?
From what I found in the jvm spec it sounds like we're good, but I'd
like to hear at least one other person say that so I have someone to
blame later. (The suggested implementation relies on Int.MinValue - N
+ N landing on Int.MinValue.)

Oh, and I blended up the fact that there are two distinct proposed
implementations going in that ticket. Yours doesn't rely on overflow,
but if we can rely on the overflow, then daniel's looks simpler.

On Tue, Dec 13, 2011 at 09:02, Paul Phillips wrote:
> On Tue, Dec 13, 2011 at 2:56 AM, Miguel Garcia wrote:
>> we get both less baggage and the inlinings (all of them) we were looking
>> for, in fewer than 140 instructions (including zero NEWs for anon-closu
>> classes):
>
> That sounds pretty promising. Can you or anyone reassure me that I
> can introduce code which relies on overflow without losing any sleep?
> From what I found in the jvm spec it sounds like we're good, but I'd
> like to hear at least one other person say that so I have someone to
> blame later. (The suggested implementation relies on Int.MinValue - N
> + N landing on Int.MinValue.)