Need help with Deep Learning for Text Data?

Model 1: Generate the Whole Sequence

The first approach involves generating the entire textual description for the photo given a photograph.

Input: Photograph

Output: Complete textual description.

This is a one-to-many sequence prediction model that generates the entire output in a one-shot manner.

Model 1 – Generate the Whole Sequence

This model puts a heavy burden on the language model to generate the right words in the right order.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset.

A one hot encoding is used for the output sequence, allowing the model to predict the probability distribution of each word in the sequence over the entire vocabulary.

All sequences are padded to the same length. This means that the model is forced to generate multiple “no word” time steps in the output sequence.

Testing this method, I found that a very large language model is required and even then it is hard to get past the model generating the NLP equivalent of persistence, e.g. generating the same word repeated for the entire sequence length as the output.

Model 2: Generate Word from Word

This is a different approach where the LSTM generates a prediction of one word given a photograph and one word as input.

Input 1: Photograph.

Input 2: Previously generated word, or start of sequence token.

Output: Next word in sequence.

This is a one-to-one sequence prediction model that generates the textual description via recursive calls to the model.

Model 2 – Generate Word From Word

The one word input is either a token to indicate the start of the sequence in the case of the first time the model is called, or is the word generated from the previous time the model was called.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset. The input word is integer encoded and passes through a word embedding.

The output word is one hot encoded to allow the model to predict the probabilities of words over the whole vocabulary.

The recursive word generation process is repeated until an end of sequence token is generated.

Testing this method, I found that the model does generate some good n-gram sequences, but gets caught in a loop repeating the same sequences of words for long descriptions. There is insufficient memory in the model to remember what has been generated previously.

Model 3: Generate Word from Sequence

Given a photograph and a sequence of words already generated for the photograph as input, predict the next word in the description.

This is a many-to-one sequence prediction model that generates a textual description via recursive calls to the model.

Model 3 – Generate Word From Sequence

It is a generalization of the above Model 2 where the input sequence of words gives the model a context for generating the next word in the sequence.

The photograph passes through a feature extraction model such as a model pre-trained on the ImageNet dataset. The photograph may be provided each time step with the sequence, or once at the beginning, which may be the preferred approach.

The input sequence is padded to a fixed-length and integer encoded to pass through a word embedding.

The output word is one hot encoded to allow the model to predict the probabilities of words over the whole vocabulary.

The recursive word generation process is repeated until an end of sequence token is generated.

This appears to be the preferred model described in papers on the topic and might be the best structure we have for this type of problem for now.

Testing this method, I have found that the model does readily generate readable descriptions, the quality of which is often refined by larger models trained for longer. Key to the skill of this model is the masking of padded input sequences. Without masking, the resulting generated sequences of words are terrible, e.g. the end of sequence token is repeated over and over.

Modeling Best Practices

This section lists some general tips when developing caption generation models.

Pre-trained Photo Feature Extraction Model. Use a photo feature extraction model pre-trained on a large dataset like ImageNet. This is called transfer learning. The Oxford Vision Geometry Group (VGG) models that won the ImageNet competition in 2014 are a good start.

Pre-trained Word Embedding Model. Use a pre-trained word embedding model with vectors either trained on average large corpus or trained on your specific text data.

Fine Tune Pre-trained Models. Explore making the pre-trained models trainable in your model to see if they can be dialed-in for your specific problem and result in a slight lift in skill.

Pre-Processing Text. Pre-process textual descriptions to reduce the vocabulary of words to generate, and in turn, the size of the model.

Preprocessing Photos. Pre-process photos for the photo feature extraction model, and even pre-extract features so the full feature extraction model is not required when training your model.

Padding Text. Pad input sequences to a fixed length; this is in fact a requirement of vectorizing your input for deep learning libraries.

Masking Padding. Use masking on the embedding layer to ignore “no word” time steps, often a zero value when words are integer encoded.

Attention. Use attention on the input sequence when generating the output word in order to both achieve better performance and understand where the model is “looking” when each word is being generated.

So you put all zeros in lieu of the visual feature starting from the second time? I mean, you should provide the image only at time 1, together with a special START word. Then, starting from time 2, you provide all zeros instead of the visual features and the previous predicted word.