Py3k Unified Numeric Hash Proposal

»

There has been a recent and interesting set of discussions on python-dev
(Decimal <-> float comparisons)
for what the best behavior for numeric type interoperability would be. The
most prominent “mistake” in the current implementation is that certain float
and int/long values compare equal, and certain Decimal and int/long values
compare equal, but all float and Decimal comparison operations
raise TypeError. Other operations between float and Decimal also
raise TypeError. Python 2.x behavior is such that comparison operations
between float and Decimal return nonsense results and other operations
raise TypeError.

Guido recently pronounced (Mixing float and Decimal) that he’d like to
consider changing the behavior to match the principle of least surprise;
all operations for all of the numeric types should return correct results.
One of the most difficult problems to solve with such a unification is the
hash invariant:

for all a and b such that a == b: hash(a) == hash(b).

While this is relatively simple to implement for the integer cases, it’s
much tricker to do efficiently for Decimal and float (and Fraction!) because
Decimals are base 10 and float are base 2.

Note that I started writing this post yesterday after studying version 3
of the patch. I have altered the inline quotes to reflect version 4,
which contains vastly improved comments that make most of my post redundant.
However, it may still help in some way because it is an independent
explanation and it provides some Python code.

Mark Dickinson proposed a very clever algorithm with an efficient
implementation in issue8188, which he summarized as follows
in the comments of the patch:

For numeric types, the hash of a number x is based on the reduction
of x modulo the prime P = 2**_PyHASH_BITS - 1. It’s designed so that
hash(x) == hash(y) whenever x and y are numerically equal, even if
x and y have different types.

A quick summary of the hashing strategy:

(1) First define the ‘reduction of x modulo P’ for any rational
number x; this is a standard extension of the usual notion of
reduction modulo P for integers. If x == p/q (written in lowest
terms), the reduction is interpreted as the reduction of p times
the inverse of the reduction of q, all modulo P; if q is exactly
divisible by P then define the reduction to be infinity. So we’ve
got a well-defined map

reduce : { rational numbers } -> { 0, 1, 2, ..., P-1, infinity }.

(2) Now for a rational number x, define hash(x) by:

reduce(x) if x >= 0
-reduce(-x) if x < 0

If the result of the reduction is infinity (this is impossible for
integers, floats and Decimals) then use the predefined hash value
_PyHASH_INF instead. _PyHASH_INF, _PyHASH_NINF and _PyHASH_NAN are also
used for the hashes of float and Decimal infinities and nans.

A selling point for the above strategy is that it makes it possible
to compute hashes of decimal and binary floating-point numbers
efficiently, even if the exponent of the binary or decimal number
is large. The key point is that

reduce(x * y) == reduce(x) * reduce(y) (modulo _PyHASH_MASK)

provided that {reduce(x), reduce(y)} != {0, infinity}. The reduction of a
binary or decimal float is never infinity, since the denominator is a power
of 2 (for binary) or a divisor of a power of 10 (for decimal). So we have,
for nonnegative x,

and reduce(10**e) can be computed efficiently by the usual modular
exponentiation algorithm. For reduce(2**e) it’s even better: since
P is of the form 2**n-1, reduce(2**e) is 2**(e mod n), and multiplication
by 2**(e mod n) modulo 2**n-1 just amounts to a rotation of bits.

The choices of P for his implementation are (2**31)-1 for 32-bit platforms and
(2**61)-1 for 64-bit platforms. These numbers are interesting because they are
the eighth and ninth Mersenne prime
numbers. I’m not entirely sure yet if these numbers being prime is essential
or not, but it’s definitely conventional for a hash modulus to be prime. A
very important feature of these numbers is that P+1 is a power of two.

One thing that wasn’t immediately obvious to me was how to define modulus of
a (rational) number f such that 0 < f < 1. We know from the above that in
the floating point case we can break f into its mantissa and exponent:

reduce(m * (2**e)) == reduce(reduce(m) * reduce(2**e))

but that leaves the cases where 0 < 2**e < 1. Well, because we are working
with a modulus of P, we know that P+1 is the multiplicative identity, so
we can find some number n such that ((P+1)**n) * (2**e) is an integer. We
also know that ((P+1)**n) * (2**e) mod P must be non-zero because P is prime.

We can demonstrate that reduce(x) where x = 2**e is quite a trivial
task for a typical CPU as follows (k is log2(P+1), which is 61 or 31).
All of the following expressions mod P are equivalent to x mod P.

You might notice the strange intentional mapping of -1 to -2, the reason for
this is simply that the convention of Python’s C API is such that return
values of -1 mean that an exception may have occurred (and a global variable
must be checked). If -1 is never returned on success then there are no
false positives so the general case is faster. Essentially Python is trading
this known worst case for a potential hash collision, which is probably the
right call.

If you read the actual C implementation there are a few additional math tricks
at play, the most important of which is this implementation of long_hash from
longobject.c:

In order to understand this better we’ll translate this to Python first, but
to do that we need to understand the layout of integers in py3k. In py3k
integers are represented as a sequence of zero or more digits, where a
digit is 2**sys.int_info.bits_per_digit bits wide, and the least
significant digit is first in the array. I’m not aware of any Python function
to see integers at this level so we’ll craft our own way to “disassemble” an
integer in the way that the C implementation will see it. Instead of tracking
the sign and size as one integer we will track the sign on its own and use
the length of the list to track size.

I’m definitely not Tim Peters or even a mathematician but I found this problem
interesting enough to dive into, especially because Guido didn’t find this
obvious either (Objects/longobject.c). I think I’ve covered it in sufficient depth for me to believe
that it works and the patch is good, but if I’m missing something please let
me know!