On Mon, Jan 23, 2012 at 4:39 PM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> Running (part of (*)) the test suite with debugging enabled on a 64-bit
> machine shows that slot collisions are much more frequent than
> hash collisions, which only account for less than 0.01% of all
> collisions.
Even 1 in 10,000 seems pretty high, though I suppose it is a result of
non-random input. (For a smalldict with 8 == 2^3 slots, on a 64-bit
machine, true hash collisions "should" only account for 1 in 2^61 slot
collisions.)
> It also shows that slot collisions in the low 1-10 range are
> most frequent, with very few instances of a dict lookup
> reaching 20 slot collisions (less than 0.0002% of all
> collisions).
Thus the argument that collisions > N implies (possibly malicious)
data that really needs a different hash -- and that this dict instance
in particular should take the hit to use an alternative hash. (Do
note that this alternative hash could be stored in the hash member of
the PyDictEntry; if anything actually *equal* to the key comes along,
it will have gone through just as many collisions, and therefore also
have been rehashed.)
> The great number of cases with 1 or 2 slot collisions surprised
> me. It seems that there's potential for improvement of
> the perturbation formula left.
In retrospect, this makes sense.
for (perturb = hash; ; perturb >>= PERTURB_SHIFT) {
i = (i << 2) + i + perturb + 1;
If two objects collided then they have the same last few last few bits
in their hashes -- which means they also have the same last few bits
in their initial perturb. And since the first probe is to slot 6i+1,
it funnels down to only even consider half the slots until the second
probe.
Also note that this explains why Randomization could make the Django
tests fail, even though 64-bit users haven't complained. The initial
hash(&mask) is the same, and the first probe is the same, and (for a
small enough dict) so are the next several. In a dict with 2^12
slots, the first 6 tries will be the same ... so I doubt the test
cases have sufficiently large amounts of sufficiently unlucky data to
notice very often -- unless the hash itself is changed, as in the
patch.