If you knew what keys you were hashing before you designed the
hash function, you could make a hash function that puts n things in n
buckets with no collisions. See perfect
hashing.

Otherwise, there are a gazillion more possible keys than buckets,
and the best any hash function can do is map an equal number of those
gazillion keys to each bucket. The number of collisions you get is
expected to follow the chi2
distribution.

Here's how to compute that:

Look up: b = total number of buckets

Look up: k = total number of keys

Compute p = k/b is the expected number of keys per bucket

Look up: bi buckets have i keys

chi2 = sum (over all i) (bi((i-p)2)/b)

The distribution is expected to have a result close to b, that is,
within 3sqrt(b) of b. Chi2 measures are usually reported
in units of standard deviations. That is, if the formula above gives
b+c*sqrt(b), they report c, and c is expected to be between -3 and 3.

Another rule of thumb, somewhat simpler: if #keys = #buckets/x, then about
#buckets/(x+.5) buckets should get filled with at least one key. See corollary
buckets.

Another rules (this one exact): if you place #keys into #buckets randomly,
expect (#keys choose 2)/#buckets collisions. Proof by induction: it's true for
0 or 1 key (no collisions). If it's true for #keys, add another key. This
adds #keys new pairs. For every placement of the previous keys, for every new
pair, there is 1 out of #buckets placements that will cause a collision for
that key, so #keys/#buckets new collisions for that new key over all new pairs,
so a total of (#keys+1 choose 2)/#buckets collisions for #keys+1 keys.

Decent hash functions are good for all data period. If your hash
does well on character data but bad on numeric data, that indicates
the hash function is much weaker than it ought to be. See my paper on funnels in hash functions.

If two decent hash functions are different, they are almost certainly
independent too. (There are
(232)2128 ways to hash 16 bytes to 4
bytes.) However, if both hashes are flawed, it's not unlikely that
they share the same flaw.

Two things. First, the function should permute its internal state. That
is, for every key, for every final internal state there should be
exactly one initial internal state that can produce it. Second, the
function should have no funnels.
That is, for every set of input bits, changing those input bits in the
key should cause at least that many output bits to change half the
time.

A hash which permutes its internal state has every piece of the key
affect the result equally. A hash with no funnels will cause outputs
to look more different than inputs.

If you are hashing 2n keys and you want a chance of
collision of at most 1 in 2m, then you need 2(n+m) bits in
your hash value. There have been about 280 machine cycles
executed in the history of the human race so far. Figure out how many
bits you need. You should use a cryptographic hash.