Estonian gap tests

Estonian gap tests corpus represents a collection of sentences, in which one word is marked as a "gap", accompanied with a list of candidate words. The corpus can be used as a benchmark for evaluating language models. The corpus covers both frequent and infrequent gap-words and includes candidate lists generated in different ways. Sentences originate from the Estonian Reference Corpus (http://www.cl.ut.ee/korpused/segakorpus/). The corpus has been tokenized using Estnltk toolkit (https://github.com/estnltk/estnltk).

An archive contains sentence files with an extension ".gaps" and candidate files with an extension "*.var". Sentence file contains one sentence per line. A line starts with an integer which indicates gap-word's offset in a sentence. The position of the first word in the sentence is zero. Based on the frequency of a gap-word, we generated four kinds of sentence files:

To each sentence file relate multiple candidate files. In a candidate file, each line contains a list of 200 candidate words, which correspond to a sentence at the same line in the related sentence file.Candidate files were generated using the same frequency ranges as sentence files. We also provide four kinds of candidate files:

File suffix Explanation--------------------------------------------------------------------------------------*.pos.var candidates with the same part of speech as a gap-word*.syn.var candidates generated with a morphological generator based on the base form of a gap-word*.w2v.var candidate words from word2vec's most similar query*.random.var random words