When trying to re-rank a large number of generation hypotheses, the LM score seems to prefer (have the lowest PPL) hypotheses which contain rare words (which I assume to be unknown). How does the scorer deal with these unknowns? Is it possible that hypotheses are preferred with unknown words over hypotheses without unknown words, just because unknown words are more likely to appear in a sentence?

You are talking about reranking the n-best list right? (not using the LM in decoding)

For the LM, <unk> is just a token as any other token - so indeed, if you had a too small vocabulary for LM training then it will get too frequent. Normally, LM are trained with very large vocabularies.