With some testing in different configurations, I'm seeing largevariations in stack frames size up to 1500 bytes for what should havearound 300 bytes at most. I also checked the reference implementation,which is essentially the same code but also comes with some test andbenchmarking infrastructure.

It seems that recent compiler versions on at least arm, arm64 and powerpchave a partial fix for this problem, but enabling "-fsched-pressure", buteven with that fix they suffer from the issue to a certain degree. Sometesting on arm64 shows that the time needed to hash a given amount ofdata is roughly proportional to the stack frame size here, which makessense given that the wp512 implementation is doing lots of loads fortable lookups, and the problem with the overly large stack is a resultof doing a lot more loads and stores for spilled registers (as seen frominspecting the object code).

Trying the same test for serpent-generic, the picture is a bit different,and while -fno-schedule-insns is generally better here than the default,-fsched-pressure wins overall, so I picked that instead.

I did not do any runtime tests with serpent, so it is possible that stackframe size does not directly correlate with runtime performance here andit actually makes things worse, but it's more likely to help here, andthe reduced stack frame size is probably enough reason to apply the patch,especially given that the crypto code is often used in deep call chains.