Preprocessing

First, you need to download images and captions from the COCO website. By default, we use train2014, val2014, val 2017 for training, validating, and testing, respectively. The data directory should have the following structure:

Once all the annotations and images are downloaded to, say, DATA_DIR, you can run the following command to map caption words into indices in a dictionary and extract image features from a pretrained VGG19 network:

python preprocess.py --data $DATA_DIR --dest-dir $DEST_DIR

Note that the resulting directory DEST_DIR will be quite large; the features for training and validation images take up 157GB and 77GB already. Experiments with HDF5 shows that there's a significant slowdown due to concurrent access with multiple data workers (see this discussion and this note). Hence, the preprocessing script saves CNN features of different images into separate files.

Training

To get started with training a model on SQuAD, you might find the following commands helpful:

The show-attend-tell model results in a validation loss of 2.761 after the first epoch. The loss decreases to 2.298 after 20 epochs and shows no lower values than 2.266 after 50 epochs. Although the implementations doesn't support fine-tuning the CNN network, the feature can be added quite easily and probably yields better performance.

Prediction

When the training is done, you can make predictions with the test dataset and compute BLEU scores: