Jun 20 2017

Addressing @wmi's concern by limiting the targets to the recurrence cycles that only the last instruction of the recurrence (that feeds the PHI instruction) can have uses outside of the recurrence. This is not an ideal solution yet, and more fundamental solution (such as having recurrence optimization as a separate pass and/or using live range analysis for it) should follow. But still I think it is worth to have it here.

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.

Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

I know that we're currently missing opportunities for large vectorizable loops with low (static) trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).

Jun 19 2017

I think this is a right approach, but concerned that the experimental results I shared on D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!

Jun 18 2017

@tejohnson Sorry for the late reply. I was out of internet for days. I have no objection to set the default value to true, but prefer to have it as a separate patch so that we can track the impact on performance of each chance more easily.

May 22 2017

Comparing the existing implementation and this patch, I don't observe noticeable compile time different with Video-SIMDBench. I compiled the benchmark suite 3 times each and the median was 27.12 sec vs 27.55 sec, while the average was 27.96sec vs 27.89 sec. There were some code size differences, but it is simply more vectorization results bigger code size. There's no difference between OptForSize and non-vectorized.

@wmi Thank you for your reply. I agree on you that we should consider tied operand group, and as @qcolombet mentioned in the previous comment, if tie operand information is already available before this pass, I'd like to discuss where would be the best place to implement this.

May 17 2017

@qcolombet Actually you're right. Having tied operands, this can be done within SSA. Do you have a suggestion for a better place to implement this? I thought what isRevCopyChain does is most close to what this patch does.

May 11 2017

I don't have a strong opinion about treating static trip count and profile-based trip count or not. I treated them separately because existing implementation sets OptForSize to true instead of returning false when it makes a profile-based decision. It would be great if someone can explain the reason behind that.

Apr 24 2017

@mkuper I can do that, but the issue is that the vectorized part doesn't dominated the execution time of the benchmark (profile is pretty flat across multiple functions), so it may not be suitable to tell the impact of vectorization change. Let me think about a good way to measure the it. Thanks!

Apr 21 2017

@MatzeB I was considering other passes as well for the sake of simplicity, but I concluded two-address instruction is the right place to do. The problem doesn't appear while we're in SSA, but becomes an issue when we make an decision about which operand register should be the destination register. I think this is why isProfitableToCommute function is in two-address instruction pass.

Apr 12 2017

With numbers, it doesn't seem worth to turn it on by default. @dberlin, just curious, do you have any rough estimation about how much compilation time can be saved by implementing rank-order based iteration optimization?

Apr 11 2017

@dberlin Thank you for your comments. I agree on you that it is not easy to invent a good heuristics around this knob, but still I find this useful for some cases that actually matters. Assuming that this doesn't make the implementation much complex, wouldn't it be better for compiler users to have more options?

No perf numbers with public benchmarks yet, but with our internal benchmark I confirmed that aggressive load PRE removes redundant instructions in a hot loop. I'll report spec2006 numbers later. This patch is for quick unlocking of the feature before making more fundamental improvement.

Mar 3 2017

@inglorion That makes a lot of sense. Maybe we don't even need -g, because -S -emit-llvm shows source_filename. I'll run some more experiments with relpath/abspath and debug locations, and submit a revised patch. Thanks for the comment!