Word Level Models

Penn Treebank

A common evaluation dataset for language modeling ist the Penn Treebank,
as pre-processed by Mikolov et al., (2011).
The dataset consists of 929k training words, 73k validation words, and
82k test words. As part of the pre-processing, words were lower-cased, numbers
were replaced with N, newlines were replaced with <eos>,
and all other punctuation was removed. The vocabulary is
the most frequent 10k words with the rest of the tokens replaced by an <unk> token.
Models are evaluated based on perplexity, which is the average
per-word log-probability (lower is better).

1B Words / Google Billion Word benchmark

The One-Billion Word benchmark is a large dataset derived from a news-commentary site.
The dataset consists of 829,250,940 tokens over a vocabulary of 793,471 words.
Importantly, sentences in this model are shuffled and hence context is limited.

Character Level Models

Hutter Prize

The Hutter Prize Wikipedia dataset, also known as enwiki8, is a byte-level dataset consisting of the
first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset.
Within these 100 million bytes are 205 unique tokens.

Penn Treebank

The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.