On Thu, Apr 19, 2012 at 10:02 AM, H. Peter Anvin <hpa [at] zytor> wrote: > > ... and I would be *even happier* with an O(1) hash (which pretty much > *have* to be constructed at compile time.)

Taking relocations into account might be interesting for hashing. I guess you could hash the relative offsets, though.

But yeah, a hash might be the way to go, and once you generate the tables at compile-time, why not go all the way? It doesn't need to be some complex perfect hash, it should be fairly straightforward to just size the hash right and use some simple linear probing model for collissions or whatever.

On Thu, Apr 19, 2012 at 10:27:09AM -0700, Linus Torvalds wrote: > On Thu, Apr 19, 2012 at 10:02 AM, H. Peter Anvin <hpa [at] zytor> wrote: > > > > ... and I would be *even happier* with an O(1) hash (which pretty much > > *have* to be constructed at compile time.) > > Taking relocations into account might be interesting for hashing. I > guess you could hash the relative offsets, though. > > But yeah, a hash might be the way to go, and once you generate the > tables at compile-time, why not go all the way? It doesn't need to be > some complex perfect hash, it should be fairly straightforward to just > size the hash right and use some simple linear probing model for > collissions or whatever.

Yeah, simplicity is the key here. I don't think we're getting that many early boot exceptions due to rdmsr or whatever to warrant adding a bunch of code.

OTOH, if we can share early and normal exception handling lookup code, then a perfect hash would make sense as those exceptions would pile up.

On 04/19/2012 10:02 AM, H. Peter Anvin wrote: > On 04/19/2012 02:26 AM, Borislav Petkov wrote: >> >> Also, move the sorting of the main exception table earlier in the boot >> process now that we handle exceptions in the early IDT handler too. >> > > I would much rather use David Daney's patchset to sort the exception > table at compile time: > > https://lkml.org/lkml/2011/11/18/427> > ... and I would be *even happier* with an O(1) hash (which pretty much > *have* to be constructed at compile time.) >

The sorting is easier infrastructure-wise, since it can be done in-place. The O(1) hash needs additional space for the hash table itself (the actual table can still be stashed where the compiler generates it.)

Generally needs ~3.2 bytes per hash table entry, rounded up to a power of two. The rounding up is for performance. The easiest way to do this is probably to let the linker create a zero-filled section of the proper size (since the linker knows the final size of the __ex_table section, and the linker script can do at least a modicum of arithmetic) and then use a tool to patch in the hash table auxilliary data.

-hpa

P.S. Another modification which was talked about in the past and there even were patches for was to make the exception table entries relative to the start of the kernel, so we don't need two relocations per entry for x86-32 and twice the amount of data that we can actually use for x86-64. As I recall we tried those patches and there was some bug that never got resolved.

This is obviously a precondition for doing the O(1) hash, since the O(1) hash can't be relocated once generated. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo [at] vger More majordomo info at http://vger.kernel.org/majordomo-info.htmlPlease read the FAQ at http://www.tux.org/lkml/

On 04/19/2012 10:38 AM, Borislav Petkov wrote: > > Yeah, simplicity is the key here. I don't think we're getting that many > early boot exceptions due to rdmsr or whatever to warrant adding a bunch > of code. > > OTOH, if we can share early and normal exception handling lookup code, then a > perfect hash would make sense as those exceptions would pile up. >

Obviously we should use the hash after the main kernel is up, too. Anything else would be silly.

I would argue that the O(1) hash makes things simpler as there is no need to deal with collisions at all. The one advantage with "plain" hashes is that can be runtime modified, which would be of some interest if we create one single hash table that includes modules.

On Thu, Apr 19, 2012 at 10:38 AM, Borislav Petkov <bp [at] amd64> wrote: > > OTOH, if we can share early and normal exception handling lookup code, then a > perfect hash would make sense as those exceptions would pile up.

Oh, any of this only makes sense if we can share it with the runtime exception lookup.

They are *seldom* hugely performance-critical, but there are some unusual loads where you do get a fair number of exceptions.

On Thu, Apr 19, 2012 at 10:59 AM, H. Peter Anvin <hpa [at] zytor> wrote: > > I would argue that the O(1) hash makes things simpler as there is no > need to deal with collisions at all.

Most of the O(1) hashes I have seen more than made up for the trivial complexity of a few linear lookups by making the hash function way more complicated.

A linear probe with a step of one really is pretty simple. Sure, you might want to make the initial hash "good enough" to not often hit the probing code, but doing a few linear probes is cheap.

In contrast, the perfect linear hashes do crazy things like having table lookups *JUST TO COMPUTE THE HASH*.

Which is f*cking stupid, really. They'll miss in the cache just at hash compute time, never mind at hash lookup. The table-driven versions look beautiful in microbenchmarks that have the tables in the L1 cache, but for something like the exception handling, I can guarantee that *nothing* is in L1, and probably not even L2.

So what you want is: - no table lookups for hashing - simple code (ie a normal "a multiply and a shift/mask or two") to keep the I$ footprint down too - you *will* take a cache miss on the actual hash table lookup, that cannot be avoided, but linear probing at least hopefully keeps it to that single cache miss even if you have to do a probe or two.

Remember: this is very much a "cold-cache behavior matters" case. We would never ever call this in a loop, at most we have loads that get a fair amount of exceptions (but will go through the exception code, so the L1 is probably blown even then).

Either way I suggest picking up David's presorting patchset since it is already done and use its infrastructure for any further improvements.

As far as a linear probe you get an average of n lookups with a packing density of 1-1/n so you are right; a linear probe with a density of say 1/2 is probably best.

Linus Torvalds <torvalds [at] linux-foundation> wrote:

>On Thu, Apr 19, 2012 at 10:59 AM, H. Peter Anvin <hpa [at] zytor> wrote: >> >> I would argue that the O(1) hash makes things simpler as there is no >> need to deal with collisions at all. > >Most of the O(1) hashes I have seen more than made up for the trivial >complexity of a few linear lookups by making the hash function way >more complicated. > >A linear probe with a step of one really is pretty simple. Sure, you >might want to make the initial hash "good enough" to not often hit the >probing code, but doing a few linear probes is cheap. > >In contrast, the perfect linear hashes do crazy things like having >table lookups *JUST TO COMPUTE THE HASH*. > >Which is f*cking stupid, really. They'll miss in the cache just at >hash compute time, never mind at hash lookup. The table-driven >versions look beautiful in microbenchmarks that have the tables in the >L1 cache, but for something like the exception handling, I can >guarantee that *nothing* is in L1, and probably not even L2. > >So what you want is: > - no table lookups for hashing > - simple code (ie a normal "a multiply and a shift/mask or two") to >keep the I$ footprint down too > - you *will* take a cache miss on the actual hash table lookup, that >cannot be avoided, but linear probing at least hopefully keeps it to >that single cache miss even if you have to do a probe or two. > >Remember: this is very much a "cold-cache behavior matters" case. We >would never ever call this in a loop, at most we have loads that get a >fair amount of exceptions (but will go through the exception code, so >the L1 is probably blown even then). > > Linus

On 04/19/2012 11:55 AM, H. Peter Anvin wrote: > Either way I suggest picking up David's presorting patchset since it is already done and use its infrastructure for any further improvements. >

It does have the advantage of already being implemented. There was a little feedback on the kbuild portions of the patch.

If you would like, I will send an updated version of the patch.

> As far as a linear probe you get an average of n lookups with a packing density of 1-1/n so you are right; a linear probe with a density of say 1/2 is probably best. >

I usually see exception table sizes on the order of 2^10 entries, so I have to wonder how much you really gain from an O(1) implementation.

David Daney

> Linus Torvalds<torvalds [at] linux-foundation> wrote: > >> On Thu, Apr 19, 2012 at 10:59 AM, H. Peter Anvin<hpa [at] zytor> wrote: >>> >>> I would argue that the O(1) hash makes things simpler as there is no >>> need to deal with collisions at all. >> >> Most of the O(1) hashes I have seen more than made up for the trivial >> complexity of a few linear lookups by making the hash function way >> more complicated. >> >> A linear probe with a step of one really is pretty simple. Sure, you >> might want to make the initial hash "good enough" to not often hit the >> probing code, but doing a few linear probes is cheap. >> >> In contrast, the perfect linear hashes do crazy things like having >> table lookups *JUST TO COMPUTE THE HASH*. >> >> Which is f*cking stupid, really. They'll miss in the cache just at >> hash compute time, never mind at hash lookup. The table-driven >> versions look beautiful in microbenchmarks that have the tables in the >> L1 cache, but for something like the exception handling, I can >> guarantee that *nothing* is in L1, and probably not even L2. >> >> So what you want is: >> - no table lookups for hashing >> - simple code (ie a normal "a multiply and a shift/mask or two") to >> keep the I$ footprint down too >> - you *will* take a cache miss on the actual hash table lookup, that >> cannot be avoided, but linear probing at least hopefully keeps it to >> that single cache miss even if you have to do a probe or two. >> >> Remember: this is very much a "cold-cache behavior matters" case. We >> would never ever call this in a loop, at most we have loads that get a >> fair amount of exceptions (but will go through the exception code, so >> the L1 is probably blown even then). >> >> Linus >

On 04/19/2012 01:17 PM, David Daney wrote: > On 04/19/2012 11:55 AM, H. Peter Anvin wrote: >> Either way I suggest picking up David's presorting patchset since it >> is already done and use its infrastructure for any further improvements. > > It does have the advantage of already being implemented. There was a > little feedback on the kbuild portions of the patch. > > If you would like, I will send an updated version of the patch.

Please. It gets us 90% of the way, and we need the infrastructure anyway to do any further work.

>> As far as a linear probe you get an average of n lookups with a >> packing density of 1-1/n so you are right; a linear probe with a >> density of say 1/2 is probably best. >> > > I usually see exception table sizes on the order of 2^10 entries, so I > have to wonder how much you really gain from an O(1) implementation.

Well, for either variant of hash table you end up with ~2 serial memory references as opposed to ~11. No idea if there are workloads where this actually matters.

On 04/19/2012 01:20 PM, H. Peter Anvin wrote: > On 04/19/2012 01:17 PM, David Daney wrote: >> On 04/19/2012 11:55 AM, H. Peter Anvin wrote: >>> Either way I suggest picking up David's presorting patchset since it >>> is already done and use its infrastructure for any further improvements. >> >> It does have the advantage of already being implemented. There was a >> little feedback on the kbuild portions of the patch. >> >> If you would like, I will send an updated version of the patch. > > Please. It gets us 90% of the way, and we need the infrastructure > anyway to do any further work. >

Linus, it's not really an x86 patchset, but would you object if I queue up David's updated patchset in -tip? I'll try to get an Ack from Ralf for the MIPS portion.

On 04/19/2012 01:26 PM, H. Peter Anvin wrote: > On 04/19/2012 01:20 PM, H. Peter Anvin wrote: >> On 04/19/2012 01:17 PM, David Daney wrote: >>> On 04/19/2012 11:55 AM, H. Peter Anvin wrote: >>>> Either way I suggest picking up David's presorting patchset since it >>>> is already done and use its infrastructure for any further improvements. >>> >>> It does have the advantage of already being implemented. There was a >>> little feedback on the kbuild portions of the patch. >>> >>> If you would like, I will send an updated version of the patch. >> >> Please. It gets us 90% of the way, and we need the infrastructure >> anyway to do any further work. >> > > Linus, it's not really an x86 patchset, but would you object if I queue > up David's updated patchset in -tip? I'll try to get an Ack from Ralf > for the MIPS portion. >

I am looking at the patch right now to see if I want to revise it in any way. Very Soon Now I will either send a new version, or indicate that I think the existing version is good enough.

On Thu, Apr 19, 2012 at 1:26 PM, H. Peter Anvin <hpa [at] zytor> wrote: > > Linus, it's not really an x86 patchset, but would you object if I queue > up David's updated patchset in -tip? †I'll try to get an Ack from Ralf > for the MIPS portion.

On 04/19/2012 01:17 PM, David Daney wrote: > > I usually see exception table sizes on the order of 2^10 entries, so I > have to wonder how much you really gain from an O(1) implementation. >

One thing that probably would give more of a boost is to use a rbtree or similar structure to figure out *which* extable (if any) we should be looking at; right now it looks like we linearly walk the modules, and don't even look to see if we are inside that module before we do a bsearch in that module's extable...

On Thu, Apr 19, 2012 at 3:16 PM, H. Peter Anvin <hpa [at] zytor> wrote: > One thing that probably would give more of a boost is to use a rbtree or > similar structure to figure out *which* extable (if any) we should be > looking at; right now it looks like we linearly walk the modules, and > don't even look to see if we are inside that module before we do a > bsearch in that module's extable...

How many entries are in the extable for a typical module? Perhaps it might make sense to bundle them all into one sorted combined table? Of course you would have to have a way to squeeze them back out of the combined table at module unload time.

This moves the cost to module load/unload time ... which is hopefully rare compared to table lookup.

On Thu, Apr 19, 2012 at 3:47 PM, Tony Luck <tony.luck [at] gmail> wrote: > > How many entries are in the extable for a typical module? †Perhaps it might > make sense to bundle them all into one sorted combined table? Of course > you would have to have a way to squeeze them back out of the combined > table at module unload time. > > This moves the cost to module load/unload time ... which is hopefully rare > compared to table lookup.

Using a traditional chained hash-table might be very amenable to this kind of situation. It's much easier to populate and doesn't have the size issues. And realistically, we can probably size the hash table for just the built-in kernel, because modules seldom have nearly as many exception table entries, so adding them later to a fixed-size table likely won't be too painful.

In fact, doing an "objdump" on the few modules I have, I didn't find a *single* exception table entry. Maybe I did something wrong? Isn't it the __ex_table in modules too?

(Admittedly, I avoid modules like the plague, so I don't tend to have very many modules, and the ones I do have tend to be pretty limited. So my module usage is absolutely not representative even if I think it should be ;^).

On 04/19/2012 03:58 PM, Linus Torvalds wrote: > > In fact, doing an "objdump" on the few modules I have, I didn't find a > *single* exception table entry. Maybe I did something wrong? Isn't it > the __ex_table in modules too? >

It is, but I suspect a lot of modules don't end up with any exception table entries, as the common forms of user space access are out of line.

I suspect that just finding the module we're in or not in before searching its ex_table is probably good enough, but a global hash would be better. That being said, some distros are complaining about the amount of time it takes to load modules on boot...

On 04/19/2012 04:26 PM, Tony Luck wrote: > On Thu, Apr 19, 2012 at 4:10 PM, H. Peter Anvin <hpa [at] zytor> wrote: >> be better. That being said, some distros are complaining about the >> amount of time it takes to load modules on boot... > > That could be fixed in the installer for the distribution by rebuilding a kernel > with the target set of modules for the system built in. >

... assuming you have a well-defined "the system". It isn't always you will boot on the same hardware every time.

> On Thu, Apr 19, 2012 at 4:10 PM, H. Peter Anvin <hpa [at] zytor> wrote: >> be better. ¬†That being said, some distros are complaining about the >> amount of time it takes to load modules on boot... > > That could be fixed in the installer for the distribution by rebuilding a kernel > with the target set of modules for the system built in.

On Thu, Apr 19, 2012 at 11:11:00AM -0700, Linus Torvalds wrote: > > OTOH, if we can share early and normal exception handling lookup code, then a > > perfect hash would make sense as those exceptions would pile up. > > Oh, any of this only makes sense if we can share it with the runtime > exception lookup. > > They are *seldom* hugely performance-critical, but there are some > unusual loads where you do get a fair number of exceptions.

I was thinking, and this is probably purely hypothetical and tangential to the topic but, what happens if we get an exception right in the middle between using the early_idt_handler's and switching to the normal trap handlers in trap_init()?

write_idt_entry() is a simple memcpy so what happens if we have a half-written descriptor? Won't we need some locking there to atomize the switch?