DataFeeder converts the data returned by paddle.reader into a data structure
of Arguments which is defined in the API. The paddle.reader usually returns
a list of mini-batch data entries. Each data entry in the list is one sample.
Each sample is a list or a tuple with one feature or multiple features.
DataFeeder converts this mini-batch data entries into Arguments in order
to feed it to C++ interface.

Indeed, data reader doesn’t have to be a function that reads and yields data
items. It can be any function with no parameter that creates a iterable
(anything can be used in forxiniterable):

iterable=data_reader()

Element produced from the iterable should be a single entry of data,
not a mini batch. That entry of data could be a single item, or a tuple of
items.
Item should be of supported type (e.g., numpy 1d
array of float32, int, list of int)

Conll05 dataset.
Paddle semantic role labeling Book and demo use this dataset as an example.
Because Conll05 is not free in public, the default downloaded URL is test set
of Conll05 (which is public). Users can change URL and MD5 to their Conll
dataset. And a pre-trained word vector model based on Wikipedia corpus is used
to initialize SRL model.

paddle.v2.dataset.conll05.get_dict()

Get the word, verb and label dictionary of Wikipedia corpus.

paddle.v2.dataset.conll05.get_embedding()

Get the trained word vector based on Wikipedia corpus.

paddle.v2.dataset.conll05.test()

Conll05 test set creator.

Because the training dataset is not free, the test dataset is used for
training. It returns a reader creator, each sample in the reader is nine
features, including sentence sequence, predicate, predicate context,
predicate context flag and tagged sequence.

This module downloads IMDB dataset from
http://ai.stanford.edu/%7Eamaas/data/sentiment/. This dataset contains a set
of 25,000 highly polar movie reviews for training, and 25,000 for testing.
Besides, this module also provides API for building dictionary.

paddle.v2.dataset.imdb.build_dict(pattern, cutoff)

Build a word dictionary from the corpus. Keys of the dictionary are words,
and values are zero-based IDs of these words.

paddle.v2.dataset.imdb.train(word_idx)

IMDB training set creator.

It returns a reader creator, each sample in the reader is an zero-based ID
sequence and label in [0, 1].

参数:

word_idx (dict) – word dictionary

返回:

Training reader creator

返回类型:

callable

paddle.v2.dataset.imdb.test(word_idx)

IMDB test set creator.

It returns a reader creator, each sample in the reader is an zero-based ID
sequence and label in [0, 1].

Movielens 1-M dataset contains 1 million ratings from 6000 users on 4000
movies, which was collected by GroupLens Research. This module will download
Movielens 1-M dataset from
http://files.grouplens.org/datasets/movielens/ml-1m.zip and parse training
set and test set into paddle reader creators.