Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design decisions
need to be made and those decisions need to be informed by how a
_fast_ jit-compatible x86-codegen is structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design
decisions need to be made and those decisions need to be
informed by how a _fast_ jit-compatible x86-codegen is
structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design
decisions need to be made and those decisions need to be
informed by how a _fast_ jit-compatible x86-codegen is
structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

It's helpful for newCTFE's development. :)
The I estimate the jit will easily be 10 times faster then my
bytecode interpreter.
which will make it about 100-1000x faster then the current CTFE.

Is this apply to templates too? I recently tried some code, and
templated version with about 10 instantiations for 4-5 types
increased compile time from about 1 sec up to 4! The template
itself was staightforward, just had a bunch of static
if-else-else for types special cases.

It's helpful for newCTFE's development. :)
The I estimate the jit will easily be 10 times faster then my
bytecode interpreter.
which will make it about 100-1000x faster then the current
CTFE.

Is this apply to templates too? I recently tried some code, and
templated version with about 10 instantiations for 4-5 types
increased compile time from about 1 sec up to 4! The template
itself was staightforward, just had a bunch of static
if-else-else for types special cases.

No it most likely will not.
However I am planning to work on speeding templates up after
newCTFE is done.

It's helpful for newCTFE's development. :)
The I estimate the jit will easily be 10 times faster then my
bytecode interpreter.
which will make it about 100-1000x faster then the current
CTFE.

Is this apply to templates too? I recently tried some code, and
templated version with about 10 instantiations for 4-5 types
increased compile time from about 1 sec up to 4! The template
itself was staightforward, just had a bunch of static
if-else-else for types special cases.

If you could share the code it would be appreciated.
If you cannot share it publicly come in irc sometime.
I am Uplink|DMD there.

Is this apply to templates too? I recently tried some code,
and templated version with about 10 instantiations for 4-5
types increased compile time from about 1 sec up to 4! The
template itself was staightforward, just had a bunch of static
if-else-else for types special cases.

If you could share the code it would be appreciated.
If you cannot share it publicly come in irc sometime.
I am Uplink|DMD there.

Sorry, I failed, that was actually caused by build system and
added dependencies(which is compiled every time no matter what,
hence the slowdown). Testing overloaded functions vs template
shows no significant difference in build times.

If you could share the code it would be appreciated.
If you cannot share it publicly come in irc sometime.
I am Uplink|DMD there.

Sorry, I failed, that was actually caused by build system and
added dependencies(which is compiled every time no matter what,
hence the slowdown). Testing overloaded functions vs template
shows no significant difference in build times.

Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design
decisions need to be made and those decisions need to be
informed by how a _fast_ jit-compatible x86-codegen is
structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

Is there not some way that you could get the current
interpreter-based implementation in to dmd sooner and then modify
the design later if necessary when you do x86 jit? The benefits
of having just *fast* ctfe sooner are perhaps larger than the
benefits of having *even faster* ctfe later. Faster templates are
also something that might be higher priority - assuming it will
be you who does the work there.
Obviously it's your time and you're free to do whatever you like
whenever you like, but I was just wondering what you're reasoning
for the order of your plan is?

Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design
decisions need to be made and those decisions need to be
informed by how a _fast_ jit-compatible x86-codegen is
structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

Is there not some way that you could get the current
interpreter-based implementation in to dmd sooner and then
modify the design later if necessary when you do x86 jit? The
benefits of having just *fast* ctfe sooner are perhaps larger
than the benefits of having *even faster* ctfe later. Faster
templates are also something that might be higher priority -
assuming it will be you who does the work there.
Obviously it's your time and you're free to do whatever you
like whenever you like, but I was just wondering what you're
reasoning for the order of your plan is?

newCTFE is currently at a phase where high-level features have to
be implemented.
And for that reason I am looking to extend the interface to
support for example scaled loads and the like.
Otherwise you and up with 1000 temporaries that add offsets to
pointers.
Also and perhaps more importantly I am sick and tired of hearing
"why don't you use ldc/llvm?" all the time...

x86 has addressing modes which allow you to multiply an index by
a certain set of scalars and add it as on offset to the pointer
you want to load.
Thereby making memory access patterns more transparent to the
caching and prefetch systems.
As well as reducing the overall code-size.

x86 has addressing modes which allow you to multiply an index
by a certain set of scalars and add it as on offset to the
pointer you want to load.
Thereby making memory access patterns more transparent to the
caching and prefetch systems.
As well as reducing the overall code-size.

Oh, ok. AFAIK The decoding of indexing modes into micro-ops (the
real instructions used inside the CPU, not the actual op-codes)
has no effect on the caching system. It may however compress the
generated code so you don't flush the instruction cache and speed
up the decoding of op-codes into micro-ops.
If you want to improve cache loads you have to consider when to
use the "prefetch" instructions, but the effect (positive or
negative) varies greatly between CPU generations so you will
basically need to target each CPU-generation individually.
Probably too much work to be worthwhile as it usually doesn't pay
off until you work on large datasets and then you usually have to
be careful with partitioning the data into cache-friendly
working-sets. Probably not so easy to do for a JIT.
You'll probably get a decent performance boost without worrying
about caching too much in the first implementation anyway. Any
gains in that area could be obliterated in the next CPU
generation... :-/

Oh, ok. AFAIK The decoding of indexing modes into micro-ops
(the real instructions used inside the CPU, not the actual
op-codes) has no effect on the caching system. It may however
compress the generated code so you don't flush the instruction
cache and speed up the decoding of op-codes into micro-ops.
If you want to improve cache loads you have to consider when to
use the "prefetch" instructions, but the effect (positive or
negative) varies greatly between CPU generations so you will
basically need to target each CPU-generation individually.
Probably too much work to be worthwhile as it usually doesn't
pay off until you work on large datasets and then you usually
have to be careful with partitioning the data into
cache-friendly working-sets. Probably not so easy to do for a
JIT.
You'll probably get a decent performance boost without worrying
about caching too much in the first implementation anyway. Any
gains in that area could be obliterated in the next CPU
generation... :-/

It's already the case. Intel and AMD (especially in Ryzen)
strongly discourage the use of prefetch instructions since at
least Core2 and Athlon64. The icache cost rarely pays off and
very often breaks the auto-prefetcher algorithms by spoiling
memory bandwidth.

It's already the case. Intel and AMD (especially in Ryzen)
strongly discourage the use of prefetch instructions since at
least Core2 and Athlon64. The icache cost rarely pays off and
very often breaks the auto-prefetcher algorithms by spoiling
memory bandwidth.

I think it just has to be done on a case-by-case basis. But if
one doesn't target a specific set of CPUs and a specific
predictable access pattern (like visiting every 4th cacheline)
then one probably shouldn't do it.
There are also so many different types to choose from:
prefetch-for-write, prefetch-for-one-time-use,
prefetch-to-cache-level2, etc... Hard to get that right for a
small-scale JIT without knowledge of the the algorithm or the
dataset.

Hi Guys,
I just begun work on the x86 jit backend.
Because right now I am at a stage where further design
decisions need to be made and those decisions need to be
informed by how a _fast_ jit-compatible x86-codegen is
structured.
Since I do believe that this is an interesting topic;
I will give you the over-the-shoulder perspective on this.
At the time of posting the video is still uploading, but you
should be able to see it soon.
https://www.youtube.com/watch?v=pKorjPAvhQY
Cheers,
Stefan

Have you considered using the LLVM jit compiler for CTFE? We
already have an LLVM front end. This would mean that CTFE would
depend on LLVM, which is a large dependency, but it would create
very fast, optimized code for CTFE on any platform.
Keep in mind that I'm not as familiar with the technical details
of CTFE so you may see alot of negative ramifications that I'm
not aware of. I just want to make sure it's being considered and
what yours and others thoughts were.

Have you considered using the LLVM jit compiler for CTFE? We
already have an LLVM front end. This would mean that CTFE would
depend on LLVM, which is a large dependency, but it would
create very fast, optimized code for CTFE on any platform.

Have you considered using the LLVM jit compiler for CTFE? We
already have an LLVM front end. This would mean that CTFE
would depend on LLVM, which is a large dependency, but it
would create very fast, optimized code for CTFE on any
platform.

I can't help but laugh at this after the above posts...

I totally missed when Stefan said:

Also and perhaps more importantly I am sick and tired of
hearing "why don't you use ldc/llvm?" all the time...

That is pretty hilarious :) I suppose I just demonstrated the
reason he is attempting to create an x86 jitter so he will have
an interface that could be extended to something like LLVM. Wow.