A dataset is an object that accepts sequences of raw data (sentence pairs
in the case of machine translation) and fields which describe how this
raw data should be processed to produce tensors. When a dataset is
instantiated, it applies the fields’ preprocessing pipeline (but not
the bit that numericalizes it or turns it into batch tensors) to the raw
data, producing a list of torchtext.data.Example objects.
torchtext’s iterators then know how to use these examples to make batches.

Parameters

fields (dict[str, Field]) – a dict with the structure
returned by onmt.inputters.get_fields(). Usually
that means the dataset side, "src" or "tgt". Keys match
the keys of items yielded by the readers, while values
are lists of (name, Field) pairs. An attribute with this
name will be created for each torchtext.data.Example
object and its value will be the result of applying the Field
to the data that matches the key. The advantage of having
sequences of fields for each piece of raw input is that it allows
the dataset to store multiple “views” of each input, which allows
for easy implementation of token-level features, mixed word-
and character-level models, and so on. (See also
onmt.inputters.TextMultiField.)

data (Iterable[Tuple[str, Any]]) – (name, data_arg) pairs
where data_arg is passed to the read() method of the
reader in readers at that position. (See the reader object for
details on the Any type.)

dirs (Iterable[str or NoneType]) – A list of directories where
data is contained. See the reader object for more details.

sort_key (Callable[[torchtext.data.Example], Any]) – A function
for determining the value on which data is sorted (i.e. length).

filter_pred (Callable[[torchtext.data.Example], bool]) – A function
that accepts Example objects and returns a boolean value
indicating whether to include that example in the dataset.

Variables

src_vocabs (List[torchtext.data.Vocab]) – Used with dynamic dict/copy
attention. There is a very short vocab for each src example.
It contains just the source words, e.g. so that the generator can
predict to copy them.