Not sure if this is the right place to ask this, but as part of looking at mir.ndslice, I was going to port a simple lattice Boltzmann fluid dynamics simulation for learning purposes, starting with a collision kernel:https://gist.github.com/dextorious/d987865a7da147645ae34cc17a87729dwhich is currently a literal, non-idiomatic port of a C++ example:https://gist.github.com/dextorious/9a65a20e353542d6fb3a8d45c515bc18Ignoring the non-idiomatic loop syntax and similar details, the D version is over 40x slower (LDC v.1.2.0, release build with -O3 and no bounds checks, compared vs. clang v4.0.0 -O3 on a Haswell CPU), which means I'm doing something horribly wrong. Having gone through the docs (and part of the vision library) and checked that the results are correct, I'm somewhat at a loss.Does anyone see a glaring error that would lead to this level of performance degradation?

@9il It would help a lot if you can extract a minimal example that shows that things are not vectorized/optimized well. There is so much going on in the current example that it's hard to analyze why things don't optimize well. Part of the problem could be that slices are used which don't optimize so well yet (it's a work-in-progress).

Morning. I posted that code as I went to sleep, now I finally had a chance to look at the assembly. As you correctly said, the opIndex calls don't get inlined, which is expensive by itself and also prohibits all further optimization via alias analysis and ultimately vectorization.

Moving over to a[i][j] style indexing improved the performance by about 2x, but it's still ~720ms vs 30ms.

Okay, there was an aliasing issue similar to what I recently encountered in Julia, which I fixed by explicitly operating on temporary variables ux0, etc., and only storing the results in the matrices at the end. This enabled some vectorization and brought the timing down to 55 ms. It still doesn't unroll the loop as extensively as clang does and the vectorization isn't quite complete, but we're now within 2x.

If I understand this correctly, what I did was a more extreme version of what you just suggested with hoisting out the rows?

Anyway, I'll fix a few of the uglier details (C++-style for -> foreach, etc.) and post up my current version as an issue on the repository.

Posted: libmir/mir-algorithm#42Curiously, I ran into the same aliasing issue when I wrote a Julia version of the same code, but there manually introducing the scalar temporaries was enough to persuade the compiler to fully vectorize the loops and get to within ~20% of the C++ benchmark.

So I suspect there may be room for improvement in terms of what information LDC exposes to the LLVM optimizer.