Try this implementation: https://github.com/mrsqueeze/spark-hash Quoting from the README, "this implementation was largely based on the algorithm described in chapter 3 of Mining of Massive Datasets" which has a great description of LSH and minhashing....

a. It's a left shift: https://docs.python.org/2/reference/expressions.html#shifting-operations It shifts the bits one to the left. b. Note that ^ is not the "to the power of" but "bitwise XOR" in Python. c. As the comment states: it defines "number of bits per signature" as 2**10 → 1024 d. The lines calculate...

It's not wrong, since LSHForest implements ANN (approximate near neighbor), and maybe that's the difference we need to take into consideration. The ANN-results are not the nearest neighbors, but an approximation of what the nearest neighbor should be. For example, a 2-nearest neighbor result looks like: from sklearn.neighbors import NearestNeighbors...

All of the hash functions are in fact used. This makes more sense if you remember that, for example, in the section "Bit sampling for Hamming distance" an individual hash function might simply return a single bit. In fact another example of an LSH hash function is to consider a...