In this post I show few things I have found/collected in the last weeks related
to the performance of the code compiled with DMD.
I have added two of them to the list of slow things (as tiny benchmarks) I keep:
http://www.fantascienza.net/leonardo/js/slow_d.zip
------------------------
This is D code that just performs many integer divisions (by 7, known at
compile time):
int divideBySeven(int x) {
return x / 7;
}
void main() {
int i = int.max;
int r;
while (i--)
r = divideBySeven(i);
printf("%d\n", r);
}
The same code in C:
#include "limits.h"
#include "stdio.h"
int divideBySeven(int x) {
return x / 7;
}
int main() {
int i = INT_MAX;
int r;
while (i--)
r = divideBySeven(i);
printf("%d\n", r);
return 0;
}
I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the D
code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.
Timing results:
intdiv:
C: 5.04 s
D: 39.32 s
The ASM generated by DMD for the function divideBySeven():
mov ECX,7
cdq
idiv ECX
The ASM generated by GCC for the function divideBySeven():
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
pushl %ebx
movl $-1840700269, %ebx
movl %ebx, %eax
popl %ebx
imull %ecx
popl %ebp
leal (%edx,%ecx), %eax
sarl $2, %eax
sarl $31, %ecx
subl %ecx, %eax
--------------------------
This is a scanning, C code:
#include "stdio.h"
int main() {
int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
int i = 300000000;
while (i--) {
// 0.63 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
}
printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}
Equal D code:
import std.stdio: printf;
int main() {
int counter0, counter1, counter2, counter3;
int i = 300_000_000;
while (i--) { // 1.23 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
}
printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}
Timings:
Scan (i = 300_000_000):
C: 0.63 s
D: 1.23 s
I can offer Asm code on request.
------------------------
The D1 docs strongly suggest to use foreach every time it's possible, avoiding
to use the less handy for(), so for almost a year I have assumed foreach is as
fast as the for() but this two versions shows differences:
20.66 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1
My classless version, 25.91 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2
That second version uses the following code in the most critical spot of the
program:
void advance(TyPlanets bodies, double dt) {
foreach(idx, ref b; bodies) {
double bm = b.mass;
foreach(ref b2; bodies[idx + 1 .. length]) {
Removing the foreach, replacing them with for loops like the following, gives a
significant performance increase, the code runs in about 19.5 second (estimated
timing of the Shootout computer, I can give you exact timings on my PC if you
want):
static void advance(TyPlanets bodies, double dt) {
for (int idx; idx < bodies.length; idx++) {
auto b = &(bodies[idx]);
double bm = b.mass;
for (int j = idx + 1; j < bodies.length; j++) {
auto b2 = &(bodies[j]);
(Note that I haven't just replaced foreach with a for, I have also removed the
slicing bodies[idx+1..$], that also slows down the code a little, but I have
seen that even removing just the slice the code is slower anyway).
------------------------
Looking at assembly produced by the Intel compiler from C code that contains
many floating point operations I can see how short and slick it is. I have also
read few articles that show various ways to manage the floating point stack as
efficiently as possible. Some of such methods are slow or quite slow but
produce more efficient code.
Using those refined methods for all the FP operations of a large program may
lead to too much long compilation times. So profile-guided optimization may be
used to find what are the hot parts of the code, to optimize the FP operations
in them expecially well, and compile all the other FP parts with a faster
compilation method.
GCC has also the "hot" function attribute to let programmers manually define
what functions are more critical:
http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html
In D in theory it can be used something like this:
optimize { ... }
(As static if { ... } it doesn't create a new scope).
It tells the compiler that a part of code that uses lot of FP operations is
speed critical, so the compiler can compile it with a quite slower floating
point stack allocation algorithm, like ones I have read about.
Bye,
bearophile

In this post I show few things I have found/collected in the last weeks
related to the performance of the code compiled with DMD.
I have added two of them to the list of slow things (as tiny benchmarks) I
keep:
http://www.fantascienza.net/leonardo/js/slow_d.zip
------------------------
This is D code that just performs many integer divisions (by 7, known at
compile time):
int divideBySeven(int x) {
return x / 7;
}
void main() {
int i = int.max;
int r;
while (i--)
r = divideBySeven(i);
printf("%d\n", r);
}
The same code in C:
#include "limits.h"
#include "stdio.h"
int divideBySeven(int x) {
return x / 7;
}
int main() {
int i = INT_MAX;
int r;
while (i--)
r = divideBySeven(i);
printf("%d\n", r);
return 0;
}
I have compiled the C code with MinGW based on GCC 4.2.1 with -O3 -s, and the
D code with DMD 1.035 with -O -release -inline, the CPU is Core2 2 GHz.
Timing results:
intdiv:
C: 5.04 s
D: 39.32 s
The ASM generated by DMD for the function divideBySeven():
mov ECX,7
cdq
idiv ECX
The ASM generated by GCC for the function divideBySeven():
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %ecx
pushl %ebx
movl $-1840700269, %ebx
movl %ebx, %eax
popl %ebx
imull %ecx
popl %ebp
leal (%edx,%ecx), %eax
sarl $2, %eax
sarl $31, %ecx
subl %ecx, %eax
--------------------------
This is a scanning, C code:
#include "stdio.h"
int main() {
int counter0 = 0, counter1 = 0, counter2 = 0, counter3 = 0;
int i = 300000000;
while (i--) {
// 0.63 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
}
printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}
Equal D code:
import std.stdio: printf;
int main() {
int counter0, counter1, counter2, counter3;
int i = 300_000_000;
while (i--) { // 1.23 s
if (i % 4 == 0) {
counter0++;
} else if (i % 4 == 1) {
counter1++;
} else if (i % 4 == 2) {
counter2++;
} else {
counter3++;
}
}
printf("%d %d %d %d\n", counter0, counter1, counter2, counter3);
return 0;
}
Timings:
Scan (i = 300_000_000):
C: 0.63 s
D: 1.23 s
I can offer Asm code on request.
------------------------
The D1 docs strongly suggest to use foreach every time it's possible, avoiding
to use the less handy for(), so for almost a year I have assumed foreach is as
fast as the for() but this two versions shows differences:
20.66 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=1
My classless version, 25.91 seconds:
http://shootout.alioth.debian.org/gp4/benchmark.php?test=nbody&lang=dlang&id=2
That second version uses the following code in the most critical spot of the
program:
void advance(TyPlanets bodies, double dt) {
foreach(idx, ref b; bodies) {
double bm = b.mass;
foreach(ref b2; bodies[idx + 1 .. length]) {
Removing the foreach, replacing them with for loops like the following, gives
a significant performance increase, the code runs in about 19.5 second
(estimated timing of the Shootout computer, I can give you exact timings on my
PC if you want):
static void advance(TyPlanets bodies, double dt) {
for (int idx; idx < bodies.length; idx++) {
auto b = &(bodies[idx]);
double bm = b.mass;
for (int j = idx + 1; j < bodies.length; j++) {
auto b2 = &(bodies[j]);
(Note that I haven't just replaced foreach with a for, I have also removed the
slicing bodies[idx+1..$], that also slows down the code a little, but I have
seen that even removing just the slice the code is slower anyway).
------------------------
Looking at assembly produced by the Intel compiler from C code that contains
many floating point operations I can see how short and slick it is. I have also
read few articles that show various ways to manage the floating point stack as
efficiently as possible. Some of such methods are slow or quite slow but
produce more efficient code.
Using those refined methods for all the FP operations of a large program may
lead to too much long compilation times. So profile-guided optimization may be
used to find what are the hot parts of the code, to optimize the FP operations
in them expecially well, and compile all the other FP parts with a faster
compilation method.
GCC has also the "hot" function attribute to let programmers manually define
what functions are more critical:
http://gcc.gnu.org/onlinedocs/gcc-4.3.0//gcc/Function-Attributes.html
In D in theory it can be used something like this:
optimize { ... }
(As static if { ... } it doesn't create a new scope).
It tells the compiler that a part of code that uses lot of FP operations is
speed critical, so the compiler can compile it with a quite slower floating
point stack allocation algorithm, like ones I have read about.
Bye,
bearophile

In this post I show few things I have found/collected in the last weeks
related to the performance of the code compiled with DMD.

[..]

The D1 docs strongly suggest to use foreach every time it's possible,
avoiding to use the less handy for(), so for almost a year I have
assumed foreach is as fast as the for() but this two versions shows
differences:

I have noticed the same problem with foreach.
A for loop is often faster. :/
I haven't tried ldc yet..

I have noticed the same problem with foreach.
A for loop is often faster. :/

You generally notice a difference only in the most inner and critical loops, so
I think it's okay to use foreach() in 80-90% of the code of a program.

I haven't tried ldc yet..

I haven't succeed in compiling code with ldc yet. But I have tried the last
LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW)
4.2.1 produces code about twice faster on my Core2. That's even quite worse
than the D code compiled with DMD.
Bye,
bearophile

I have noticed the same problem with foreach.
A for loop is often faster. :/

You generally notice a difference only in the most inner and critical loops,
so I think it's okay to use foreach() in 80-90% of the code of a program.

I haven't tried ldc yet..

I haven't succeed in compiling code with ldc yet. But I have tried the last
LLVM with the C code of the nbody benchmark of the Shootout, and GCC (MinGW)
4.2.1 produces code about twice faster on my Core2. That's even quite worse
than the D code compiled with DMD.

That's a bummer to hear that. But the good point of LLVM is that it's
on an upward trend. With DMD only Walter has the ability to work on
backend codegen improvements and we all know how much time he has
available for such things. Eventually I believe LLVM will leave DMD
in the dust and probably GCC too.
--bb

In this post I show few things I have found/collected in the last weeks
related to the performance of the code compiled with DMD.

[..]

The D1 docs strongly suggest to use foreach every time it's possible,
avoiding to use the less handy for(), so for almost a year I have
assumed foreach is as fast as the for() but this two versions shows
differences:

I have noticed the same problem with foreach.
A for loop is often faster. :/

One thing to be clear about when talking about the speed of foreach is
whether you're using it on a built-in array or on a user type with an
opApply. The former is expected to be as fast as for(), the latter is
definitely not because it has to call a delegate each time around the
loop.
--bb

In this post I show few things I have found/collected in the last
weeks related to the performance of the code compiled with DMD.

[..]

The D1 docs strongly suggest to use foreach every time it's possible,
avoiding to use the less handy for(), so for almost a year I have
assumed foreach is as fast as the for() but this two versions shows
differences:

I have noticed the same problem with foreach. A for loop is often
faster. :/

One thing to be clear about when talking about the speed of foreach is
whether you're using it on a built-in array or on a user type with an
opApply. The former is expected to be as fast as for(), the latter is
definitely not because it has to call a delegate each time around the
loop.

In this post I show few things I have found/collected in the last
weeks related to the performance of the code compiled with DMD.

[..]

The D1 docs strongly suggest to use foreach every time it's possible,
avoiding to use the less handy for(), so for almost a year I have
assumed foreach is as fast as the for() but this two versions shows
differences:

I have noticed the same problem with foreach. A for loop is often
faster. :/

One thing to be clear about when talking about the speed of foreach is
whether you're using it on a built-in array or on a user type with an
opApply. The former is expected to be as fast as for(), the latter is
definitely not because it has to call a delegate each time around the
loop.

Nobody would use "for"-loops with opApply in real world code. :P

Er, right, which is why using a for() loop is faster that foreach() in
that case. Sorry if I wasn't clear about that.
--bb