CompSci 261, Winter 2018, Homework 2

In the deletion algorithm for linear probing, we need to
know whether an index $i$ appears between $h(k)$ and $j$, where
$h(k)$ is the starting index for key $k$ and all three of $i$,
$h(k)$, and $j$ belong to the same block of occupied cells. Give
pseudocode or a formula for performing this betweenness test,
that correctly handles the case when the block of cells wraps
around from the end of the hash table back to its start.

Recall that, for a linear probing hash table with load
factor at most $1/2$, the multiplicative Chernoff bound tells us
that the probability of seeing a block of exactly $B$ filled
cells in one particular location in the hash array is at most
$(e/4)^{B/2}$. Use this formula to give a bound on the expected
number of blocks of filled cells of this length throughout the
entire hash array. What is the largest value of $B$ for which
your bound on the expected number is at least $1$? (Hint: How
many different blocks of length $B$ are there?)

Solution: Let $n$ be the number of keys, so (with the given load factor) there are $2n$ cells in the hash table, and $2n$ places where a block of $B$ cells could start. By linearity of expectation, the expected number of filled cells of this length is at most $2n(e/4)^{B/2}$. To make this equal to 1, we need $B$ to be proportional to $\log n$; more precisely, $B=2\log_{4/e}(2n)$.

The version of cuckoo hashing that we described in class
uses two hash functions $h_1(k)$ and $h_2(k)$ that both
map a key $k$ to a position within a single array $A$ of
key/value pairs. Sometimes cuckoo hashing is instead implemented
with two different arrays $A_1$ and $A_2$, so that $h_1(k)$ maps
key $k$ to a position in array $A_1$ and $h_2(k)$ maps key $k$
to a position in array $A_2$. Recall that one way for cuckoo
hashing to fail is for three keys to all map to the same two
array cells.

Compute the probability that a given triple of keys all map
to the same two array cells, for both kinds of cuckoo hash
table, when there are $n$ keys and a total of $2n$ array cells.
(For the version of cuckoo hashing with two tables, you can
assume that each of the two arrays has $n$ cells.) For which of
these two versions of cuckoo hashing is there a greater
probability of this type of failure, or are they both equal in
this respect?

Solution: Let's call the three keys $x$, $y$, and $z$. With the single table of $2n$ cells, there is a $1/(2n)$ probability that $h_1(x)=h_2(x)$, and when this happens the probability that the four other hash values also land in the same place is $1/(2n)^4$. On the other hand, there is a $(2n-1)/(2n)$ probability that $h_1(x)\ne h_2(x)$, and when this happens each other key has a $2/(2n)^2$ probability of landing in the same two places (possibly with the two hash functions swapped, but this still counts as a collision). So the total probability of a three-way collision is
$$\frac{1}{32n^5}+\frac{2n-1}{8n^5}=\frac{2n+3}{8n^5}\approx \frac{1}{4n^4}.$$
With two tables of $n$ cells, collisions only happen when $h_1(x)=h_1(y)=h_1(z)$ and $h_2(x)=h_2(y)=h_2(z)$; there is no possibility of the two functions being swapped.
Each of $h_i(y)$ and $h_i(z)$ (for $i=1$ or $2$) has probability $1/n$ of equalling $h_i(x)$. So the calculation is much simpler, and the probability of a three-way collision is
$$\frac{1}{n^4}.$$
So triple collisions are much less likely (by nearly a factor of four) in the single-table solution as they are in the two-table solution.

Recall that the load factor of a hash table is defined as
$n/N$, where $n$ is the number of key-value pairs stored in the
table and $N$ is the number of cells in the hash array. If a
single reference to an object can be stored in a single machine
word, then a single key-value pair takes two machine words to
store, so the minimum amount of memory required for any
structure that lists all pairs would be $2n$, while an
open-addressing hash table would use $2N$ machine words. Thus,
the load factor also equals $2n/2N$, the ratio between the
minimum amount of memory needed and the amount of memory
actually used.

Now, consider a hash chaining data structure, where we store
the key-value pairs within any cell of the hash table as a
linked list (a collection of objects that each have three
references or pointers, for the key, value, and next object in
the list) and where each table cell stores a single reference or
pointer to the first object in its list. If this structure uses
$M$ memory cells, define the "effective load factor" of this
structure to be $2n/M$, where $2n$ is again the minimum storage
required for $n$ key-value pairs and $M$ is the storage actually
used.

Write a formula that translates the load factor $f$ of a hash
chaining data structure to the effective load factor of the same
structure. What is the value of $f$ for which the effective load
factor is $1/2$?

Solution: If there are $n$ keys, and the load factor is $f$, then there are $n/f$ cells in the array of the hash table (a single pointer each), and $n$ linked list objects (with three pointers each), for a total memory usage of $n(1/f+3)$ pointers. Therefore, the effective load factor is
$$\frac{2}{1/f+3}.$$
To get this effective load factor to be $1/2$, we need $f$ to be $1$.
What this means is that, in order to use the same amount of memory as an open-addressing hash table with load factor $1/2$, a hash chaining data structure would have to use the higher load factor $1$. This is relevant if we want to compare the performance of different hashing algorithms with the same memory usage as each other. It wouldn't be fair to compare open addressing and hash chaining at the same (actual) load factors, because that comparison would give much more memory to the chaining algorithm than to the open addressing algorithm.