Details

Introduce three pseudo instructions to be used during DAG ISel to represent v8.1-m low-overhead loops. One maps to set_loop_iterations while loop_decrement_reg is lowered to two, so that we can separate the decrement and branching operations. The pseudo instructions are expanded pre-emission where we can still decide whether we actually want to generate a low-overhead loop. The pass currently bails, revering to an sub, icmp and br, in the cases where a call or stack spill/restore happens between the decrement and branching instructions.

yes, it is massive! :-) But I think we can simplify this a lot by using intrinsic @llvm.arm.space:

// A space-consuming intrinsic primarily for testing ARMConstantIslands. The
// first argument is the number of bytes this "instruction" takes up, the second
// and return value are essentially chains, used to force ordering during ISel.
def int_arm_space : Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty], []>;

as mentioned constant island tests are using this, I think you want to do something similar here.

How confident are you that this is always valid? From what I understand it creates an assumption that the code will not grow past this point in the pipeline (technically that the LE will not move further from the loop start). What would stop, for example, constant island pass from putting a constant pool into the loop?

And as Sjoerd says, I found out that DLS doesn't really do anything and can be ignored (all the "smarts" are in the LE). WLS obviously does more, and a DLSTP is important for tail predication. But if the count is already in lr, we needn't emit the DLS.

I will try to reorder the final passes. I hope that I can change the size of the pseudo instructions to be pessimistically big enough to be a cmp and br. I imagine that TTI will have to try to calculate the size, or at least the amount of live variables, so that these loops don't cause unnecessary register pressure and actually slow things down because of spills.

I don't know of any reason why this way round wouldn't work, so long as we tell constant island pass that the size of the instructions is large enough. It might be a little inefficient at times? I think the version of this on the branch put the handling _into_ constant island pass, but that version worked a little differently.

This looks good to me. I.e., the approach of expanding the pseudos as late as possible, and the availability of size estimation in TTI for these pseudos, seems the right approach to me. I hope that when overestimate the size, if necessary, that this will be always be safe.

This looks pretty neat to me. I have some comments that could be done in followups (as in, if its broken, we can fix it later).

It may be better to put this into constant island pass instead, to use its iterative nature and have it remain as the only place that deals with sizes.

Can you put a nice long comment into somewhere like the start of ARMFinalizeHardwareLoops explaining things like what the pseudos do and why they are needed (they are for registry allocation, essentially?), and what assumptions this is making about them remaining in specific blocks.

12 bytes for something that should usually take 4 seems overly pessimistic to me, and may pessimise other optimisations. Can we get this smaller somehow? I would expect the fallback to probably be (subs; bne) ideally.

I will add some comments, but I really don't think this belongs in constant islands. This doesn't have to worry about iterative changes, the loop size may only vary by 8 bytes, which is nothing compared to the 4KB that we need to concern ourselves with. Plus this is a very specific pass, especially once we start having to handle the tail predicated loops!

Sure, I'll add a comment. To summarise though, its because the decremented value of LR has probably been stored to the stack - but this decrement isn't really going to happen until the end of the block, where we can't spill. If we want the value of LR to be on the stack, we'd have to perform a manual sub. Added to this that we're probably reloading LR at the bottom of the loop for LE, we'd have to either perform an add before LE or use the other form of LE that doesn't perform the decrement.

I will add some comments, but I really don't think this belongs in constant islands. This doesn't have to worry about iterative changes, the loop size may only vary by 8 bytes, which is nothing compared to the 4KB that we need to concern ourselves with. Plus this is a very specific pass, especially once we start having to handle the tail predicated loops!

Its not really about the 4KB range of a LE instruction, I agree that's not super important, but the much smaller range of a cbz for example. If we are over-estimating the size of the loop in constant island pass, we may loose out on other optimisations we would otherwise have performed.