The x86 FPU/NPX code isn't quite fit for the needs of modern
processors.
There is no support for AVX and later multimedia/vector extensions
yet,

and it is impossible to use the FPU in the kernel, which would be
useful for IPSEC (the AES instructions, SHA is announced).

Quite true; I would say it is a remnant from the past, where FP
operations were not required in kernel. Latest additions made it
valuable through.

Thanks for looking into it.

This could certainly be fixed within the MD i386/amd64 code (there is
not much shared currently), adding even mode MD code.

I've tried to adopt the "pcu" framework instead which is already used
on

arm, mips and ppc. This doesn't solve the problems mentioned above
automatically, but it reduces the amount of MD code massively.
The first patch (x86_use_pcu) implements this. It is well tested
for native code, on amd64 with and without xen, and on i386. There
might be flaws in emulation code, I just didn't test this yet.
The other patch (amd64_dynfpustate) moves the FPU state from the PCB
into a dynamically allocated object. This would allow to grow
the save area as needed by modern CPUs, without growing the PCB
for everyone.
This is strictly a POC - it builds only in amd64 and it contains some
hacks which break binary emulation. Nevertheless it works, and
didn't any harm to my test system yet.
Obviously, there might be performance implications, and possibly
problems on memory shortage.

I have two questions regarding dynfpustate:

1 - first, is it safe to have a sleepable allocation right into
fputrap(), especially in case of memory pressure?
2 - following 1), why not kmem_alloc the pcb FPU within cpu_attach?
There are already allocations that happen there (in particular: reserved
pages for MMU), memory pressure is low at this time (during boot, and a
hotplugged CPU would not be usable until cpu_attach return, no risk of
having a working CPU but without FPU state tracking). IMO it would make
sense to do it there. The FPU is not expected to change dynamically.