Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.

It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order. Recently, deep learning methods have achieved state-of-the-art results on examples of this problem.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

In this tutorial, you will discover how to develop a photo captioning deep learning model from scratch.

After completing this tutorial, you will know:

How to prepare photo and text data for training a deep learning model.

How to design and train a deep learning caption generation model.

How to evaluate a train caption generation model and use it to caption entirely new photographs.

Let’s get started.

Update Nov/2017: Added note about a bug introduced in Keras 2.1.0 and 2.1.1 that impacts the code in this tutorial.

Update Dec/2017: Updated a typo in the function name when explaining how to save descriptions to file, thanks Minel.

How to Develop a Deep Learning Caption Generation Model in Python from ScratchPhoto by Living in Monrovia, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

Photo and Caption Dataset

Prepare Photo Data

Prepare Text Data

Develop Deep Learning Model

Evaluate Model

Generate New Captions

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

We introduce a new benchmark collection for sentence-based image description and search, consisting of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events.

…

The images were chosen from six different Flickr groups, and tend not to contain any well-known people or locations, but were manually selected to depict a variety of scenes and situations.

The dataset is available for free. You must complete a request form and the links to the dataset will be emailed to you. I would love to link to them for you, but the email address expressly requests: “Please do not redistribute the dataset“.

One measure that can be used to evaluate the skill of the model are BLEU scores. For reference, below are some ball-park BLEU scores for skillful models when evaluated on the test dataset (taken from the 2017 paper “Where to put the Image in an Image Caption Generator“):

BLEU-1: 0.401 to 0.578.

BLEU-2: 0.176 to 0.390.

BLEU-3: 0.099 to 0.260.

BLEU-4: 0.059 to 0.170.

We describe the BLEU metric more later when we work on evaluating our model.

Next, let’s look at how to load the images.

Prepare Photo Data

We will use a pre-trained model to interpret the content of the photos.

There are many models to choose from. In this case, we will use the Oxford Visual Geometry Group, or VGG, model that won the ImageNet competition in 2014. Learn more about the model here:

Keras provides this pre-trained model directly. Note, the first time you use this model, Keras will download the model weights from the Internet, which are about 500 Megabytes. This may take a few minutes depending on your internet connection.

We could use this model as part of a broader image caption model. The problem is, it is a large model and running each photo through the network every time we want to test a new language model configuration (downstream) is redundant.

Instead, we can pre-compute the “photo features” using the pre-trained model and save them to file. We can then load these features later and feed them into our model as the interpretation of a given photo in the dataset. It is no different to running the photo through the full VGG model; it is just we will have done it once in advance.

This is an optimization that will make training our models faster and consume less memory.

We can load the VGG model in Keras using the VGG class. We will remove the last layer from the loaded model, as this is the model used to predict a classification for a photo. We are not interested in classifying images, but we are interested in the internal representation of the photo right before a classification is made. These are the “features” that the model has extracted from the photo.

Below is a function named extract_features() that, given a directory name, will load each photo, prepare it for VGG, and collect the predicted features from the VGG model. The image features are a 1-dimensional 4,096 element vector.

The function returns a dictionary of image identifier to image features.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

# extract features from each photo in the directory

def extract_features(directory):

# load the model

model=VGG16()

# re-structure the model

model.layers.pop()

model=Model(inputs=model.inputs,outputs=model.layers[-1].output)

# summarize

print(model.summary())

# extract features from each photo

features=dict()

forname inlistdir(directory):

# load an image from file

filename=directory+'/'+name

image=load_img(filename,target_size=(224,224))

# convert the image pixels to a numpy array

image=img_to_array(image)

# reshape data for the model

image=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))

# prepare the image for the VGG model

image=preprocess_input(image)

# get features

feature=model.predict(image,verbose=0)

# get image id

image_id=name.split('.')[0]

# store feature

features[image_id]=feature

print('>%s'%name)

returnfeatures

We can call this function to prepare the photo data for testing our models, then save the resulting dictionary to a file named ‘features.pkl‘.

The complete example is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

from os import listdir

from pickle import dump

from keras.applications.vgg16 import VGG16

from keras.preprocessing.image import load_img

from keras.preprocessing.image import img_to_array

from keras.applications.vgg16 import preprocess_input

from keras.models import Model

# extract features from each photo in the directory

def extract_features(directory):

# load the model

model=VGG16()

# re-structure the model

model.layers.pop()

model=Model(inputs=model.inputs,outputs=model.layers[-1].output)

# summarize

print(model.summary())

# extract features from each photo

features=dict()

forname inlistdir(directory):

# load an image from file

filename=directory+'/'+name

image=load_img(filename,target_size=(224,224))

# convert the image pixels to a numpy array

image=img_to_array(image)

# reshape data for the model

image=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))

# prepare the image for the VGG model

image=preprocess_input(image)

# get features

feature=model.predict(image,verbose=0)

# get image id

image_id=name.split('.')[0]

# store feature

features[image_id]=feature

print('>%s'%name)

returnfeatures

# extract features from all images

directory='Flicker8k_Dataset'

features=extract_features(directory)

print('Extracted Features: %d'%len(features))

# save to file

dump(features,open('features.pkl','wb'))

Running this data preparation step may take a while depending on your hardware, perhaps one hour on the CPU with a modern workstation.

At the end of the run, you will have the extracted features stored in ‘features.pkl‘ for later use. This file will be about 127 Megabytes in size.

Prepare Text Data

The dataset contains multiple descriptions for each photograph and the text of the descriptions requires some minimal cleaning.

First, we will load the file containing all of the descriptions.

1

2

3

4

5

6

7

8

9

10

11

12

13

# load doc into memory

def load_doc(filename):

# open the file as read only

file=open(filename,'r')

# read all text

text=file.read()

# close the file

file.close()

returntext

filename='Flickr8k_text/Flickr8k.token.txt'

# load descriptions

doc=load_doc(filename)

Each photo has a unique identifier. This identifier is used on the photo filename and in the text file of descriptions.

Next, we will step through the list of photo descriptions. Below defines a function load_descriptions() that, given the loaded document text, will return a dictionary of photo identifiers to descriptions. Each photo identifier maps to a list of one or more textual descriptions.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

# extract descriptions for images

def load_descriptions(doc):

mapping=dict()

# process lines

forline indoc.split('\n'):

# split line by white space

tokens=line.split()

iflen(line)<2:

continue

# take the first token as the image id, the rest as the description

image_id,image_desc=tokens[0],tokens[1:]

# remove filename from image id

image_id=image_id.split('.')[0]

# convert description tokens back to string

image_desc=' '.join(image_desc)

# create the list if needed

ifimage_id notinmapping:

mapping[image_id]=list()

# store description

mapping[image_id].append(image_desc)

returnmapping

# parse descriptions

descriptions=load_descriptions(doc)

print('Loaded: %d '%len(descriptions))

Next, we need to clean the description text. The descriptions are already tokenized and easy to work with.

We will clean the text in the following ways in order to reduce the size of the vocabulary of words we will need to work with:

Convert all words to lowercase.

Remove all punctuation.

Remove all words that are one character or less in length (e.g. ‘a’).

Remove all words with numbers in them.

Below defines the clean_descriptions() function that, given the dictionary of image identifiers to descriptions, steps through each description and cleans the text.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

import string

def clean_descriptions(descriptions):

# prepare translation table for removing punctuation

table=str.maketrans('','',string.punctuation)

forkey,desc_list indescriptions.items():

foriinrange(len(desc_list)):

desc=desc_list[i]

# tokenize

desc=desc.split()

# convert to lower case

desc=[word.lower()forwordindesc]

# remove punctuation from each token

desc=[w.translate(table)forwindesc]

# remove hanging 's' and 'a'

desc=[wordforwordindesc iflen(word)>1]

# remove tokens with numbers in them

desc=[wordforwordindesc ifword.isalpha()]

# store as string

desc_list[i]=' '.join(desc)

# clean descriptions

clean_descriptions(descriptions)

Once cleaned, we can summarize the size of the vocabulary.

Ideally, we want a vocabulary that is both expressive and as small as possible. A smaller vocabulary will result in a smaller model that will train faster.

For reference, we can transform the clean descriptions into a set and print its size to get an idea of the size of our dataset vocabulary.

1

2

3

4

5

6

7

8

9

10

11

# convert the loaded descriptions into a vocabulary of words

def to_vocabulary(descriptions):

# build a list of all description strings

all_desc=set()

forkey indescriptions.keys():

[all_desc.update(d.split())fordindescriptions[key]]

returnall_desc

# summarize vocabulary

vocabulary=to_vocabulary(descriptions)

print('Vocabulary Size: %d'%len(vocabulary))

Finally, we can save the dictionary of image identifiers and descriptions to a new file named descriptions.txt, with one image identifier and description per line.

Below defines the save_descriptions() function that, given a dictionary containing the mapping of identifiers to descriptions and a filename, saves the mapping to file.

1

2

3

4

5

6

7

8

9

10

11

12

13

# save descriptions to file, one per line

def save_descriptions(descriptions,filename):

lines=list()

forkey,desc_list indescriptions.items():

fordesc indesc_list:

lines.append(key+' '+desc)

data='\n'.join(lines)

file=open(filename,'w')

file.write(data)

file.close()

# save descriptions

save_descriptions(descriptions,'descriptions.txt')

Putting this all together, the complete listing is provided below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

import string

# load doc into memory

def load_doc(filename):

# open the file as read only

file=open(filename,'r')

# read all text

text=file.read()

# close the file

file.close()

returntext

# extract descriptions for images

def load_descriptions(doc):

mapping=dict()

# process lines

forline indoc.split('\n'):

# split line by white space

tokens=line.split()

iflen(line)<2:

continue

# take the first token as the image id, the rest as the description

image_id,image_desc=tokens[0],tokens[1:]

# remove filename from image id

image_id=image_id.split('.')[0]

# convert description tokens back to string

image_desc=' '.join(image_desc)

# create the list if needed

ifimage_id notinmapping:

mapping[image_id]=list()

# store description

mapping[image_id].append(image_desc)

returnmapping

def clean_descriptions(descriptions):

# prepare translation table for removing punctuation

table=str.maketrans('','',string.punctuation)

forkey,desc_list indescriptions.items():

foriinrange(len(desc_list)):

desc=desc_list[i]

# tokenize

desc=desc.split()

# convert to lower case

desc=[word.lower()forwordindesc]

# remove punctuation from each token

desc=[w.translate(table)forwindesc]

# remove hanging 's' and 'a'

desc=[wordforwordindesc iflen(word)>1]

# remove tokens with numbers in them

desc=[wordforwordindesc ifword.isalpha()]

# store as string

desc_list[i]=' '.join(desc)

# convert the loaded descriptions into a vocabulary of words

def to_vocabulary(descriptions):

# build a list of all description strings

all_desc=set()

forkey indescriptions.keys():

[all_desc.update(d.split())fordindescriptions[key]]

returnall_desc

# save descriptions to file, one per line

def save_descriptions(descriptions,filename):

lines=list()

forkey,desc_list indescriptions.items():

fordesc indesc_list:

lines.append(key+' '+desc)

data='\n'.join(lines)

file=open(filename,'w')

file.write(data)

file.close()

filename='Flickr8k_text/Flickr8k.token.txt'

# load descriptions

doc=load_doc(filename)

# parse descriptions

descriptions=load_descriptions(doc)

print('Loaded: %d '%len(descriptions))

# clean descriptions

clean_descriptions(descriptions)

# summarize vocabulary

vocabulary=to_vocabulary(descriptions)

print('Vocabulary Size: %d'%len(vocabulary))

# save to file

save_descriptions(descriptions,'descriptions.txt')

Running the example first prints the number of loaded photo descriptions (8,092) and the size of the clean vocabulary (8,763 words).

1

2

Loaded: 8,092

Vocabulary Size: 8,763

Finally, the clean descriptions are written to ‘descriptions.txt‘.

Taking a look at the file, we can see that the descriptions are ready for modeling. The order of descriptions in your file may vary.

1

2

3

4

5

6

2252123185_487f21e336 bunch on people are seated in stadium

2252123185_487f21e336 crowded stadium is full of people watching an event

2252123185_487f21e336 crowd of people fill up packed stadium

2252123185_487f21e336 crowd sitting in an indoor stadium

2252123185_487f21e336 stadium full of people watch game

...

Develop Deep Learning Model

In this section, we will define the deep learning model and fit it on the training dataset.

This section is divided into the following parts:

Loading Data.

Defining the Model.

Fitting the Model.

Complete Example.

Loading Data

First, we must load the prepared photo and text data so that we can use it to fit the model.

We are going to train the data on all of the photos and captions in the training dataset. While training, we are going to monitor the performance of the model on the development dataset and use that performance to decide when to save models to file.

The train and development dataset have been predefined in the Flickr_8k.trainImages.txt and Flickr_8k.devImages.txt files respectively, that both contain lists of photo file names. From these file names, we can extract the photo identifiers and use these identifiers to filter photos and descriptions for each set.

The function load_set() below will load a pre-defined set of identifiers given the train or development sets filename.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

# load doc into memory

def load_doc(filename):

# open the file as read only

file=open(filename,'r')

# read all text

text=file.read()

# close the file

file.close()

returntext

# load a pre-defined list of photo identifiers

def load_set(filename):

doc=load_doc(filename)

dataset=list()

# process line by line

forline indoc.split('\n'):

# skip empty lines

iflen(line)<1:

continue

# get the image identifier

identifier=line.split('.')[0]

dataset.append(identifier)

returnset(dataset)

Now, we can load the photos and descriptions using the pre-defined set of train or development identifiers.

Below is the function load_clean_descriptions() that loads the cleaned text descriptions from ‘descriptions.txt‘ for a given set of identifiers and returns a dictionary of identifiers to lists of text descriptions.

The model we will develop will generate a caption given a photo, and the caption will be generated one word at a time. The sequence of previously generated words will be provided as input. Therefore, we will need a ‘first word’ to kick-off the generation process and a ‘last word‘ to signal the end of the caption.

We will use the strings ‘startseq‘ and ‘endseq‘ for this purpose. These tokens are added to the loaded descriptions as they are loaded. It is important to do this now before we encode the text so that the tokens are also encoded correctly.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

# load clean descriptions into memory

def load_clean_descriptions(filename,dataset):

# load document

doc=load_doc(filename)

descriptions=dict()

forline indoc.split('\n'):

# split line by white space

tokens=line.split()

# split id from description

image_id,image_desc=tokens[0],tokens[1:]

# skip images not in the set

ifimage_id indataset:

# create list

ifimage_id notindescriptions:

descriptions[image_id]=list()

# wrap description in tokens

desc='startseq '+' '.join(image_desc)+' endseq'

# store

descriptions[image_id].append(desc)

returndescriptions

Next, we can load the photo features for a given dataset.

Below defines a function named load_photo_features() that loads the entire set of photo descriptions, then returns the subset of interest for a given set of photo identifiers.

This is not very efficient; nevertheless, this will get us up and running quickly.

1

2

3

4

5

6

7

# load photo features

def load_photo_features(filename,dataset):

# load all features

all_features=load(open(filename,'rb'))

# filter features

features={k:all_features[k]forkindataset}

returnfeatures

We can pause here and test everything developed so far.

The complete code example is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

from pickle import load

# load doc into memory

def load_doc(filename):

# open the file as read only

file=open(filename,'r')

# read all text

text=file.read()

# close the file

file.close()

returntext

# load a pre-defined list of photo identifiers

def load_set(filename):

doc=load_doc(filename)

dataset=list()

# process line by line

forline indoc.split('\n'):

# skip empty lines

iflen(line)<1:

continue

# get the image identifier

identifier=line.split('.')[0]

dataset.append(identifier)

returnset(dataset)

# load clean descriptions into memory

def load_clean_descriptions(filename,dataset):

# load document

doc=load_doc(filename)

descriptions=dict()

forline indoc.split('\n'):

# split line by white space

tokens=line.split()

# split id from description

image_id,image_desc=tokens[0],tokens[1:]

# skip images not in the set

ifimage_id indataset:

# create list

ifimage_id notindescriptions:

descriptions[image_id]=list()

# wrap description in tokens

desc='startseq '+' '.join(image_desc)+' endseq'

# store

descriptions[image_id].append(desc)

returndescriptions

# load photo features

def load_photo_features(filename,dataset):

# load all features

all_features=load(open(filename,'rb'))

# filter features

features={k:all_features[k]forkindataset}

returnfeatures

# load training dataset (6K)

filename='Flickr8k_text/Flickr_8k.trainImages.txt'

train=load_set(filename)

print('Dataset: %d'%len(train))

# descriptions

train_descriptions=load_clean_descriptions('descriptions.txt',train)

print('Descriptions: train=%d'%len(train_descriptions))

# photo features

train_features=load_photo_features('features.pkl',train)

print('Photos: train=%d'%len(train_features))

Running this example first loads the 6,000 photo identifiers in the test dataset. These features are then used to filter and load the cleaned description text and the pre-computed photo features.

We are nearly there.

1

2

3

Dataset: 6,000

Descriptions: train=6,000

Photos: train=6,000

The description text will need to be encoded to numbers before it can be presented to the model as in input or compared to the model’s predictions.

The first step in encoding the data is to create a consistent mapping from words to unique integer values. Keras provides the Tokenizer class that can learn this mapping from the loaded description data.

Below defines the to_lines() to convert the dictionary of descriptions into a list of strings and the create_tokenizer() function that will fit a Tokenizer given the loaded photo description text.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

# convert a dictionary of clean descriptions to a list of descriptions

def to_lines(descriptions):

all_desc=list()

forkey indescriptions.keys():

[all_desc.append(d)fordindescriptions[key]]

returnall_desc

# fit a tokenizer given caption descriptions

def create_tokenizer(descriptions):

lines=to_lines(descriptions)

tokenizer=Tokenizer()

tokenizer.fit_on_texts(lines)

returntokenizer

# prepare tokenizer

tokenizer=create_tokenizer(train_descriptions)

vocab_size=len(tokenizer.word_index)+1

print('Vocabulary Size: %d'%vocab_size)

We can now encode the text.

Each description will be split into words. The model will be provided one word and the photo and generate the next word. Then the first two words of the description will be provided to the model as input with the image to generate the next word. This is how the model will be trained.

For example, the input sequence “little girl running in field” would be split into 6 input-output pairs to train the model:

1

2

3

4

5

6

7

X1, X2 (text sequence), y (word)

photo startseq, little

photo startseq, little, girl

photo startseq, little, girl, running

photo startseq, little, girl, running, in

photo startseq, little, girl, running, in, field

photo startseq, little, girl, running, in, field, endseq

Later, when the model is used to generate descriptions, the generated words will be concatenated and recursively provided as input to generate a caption for an image.

The function below named create_sequences(), given the tokenizer, a maximum sequence length, and the dictionary of all descriptions and photos, will transform the data into input-output pairs of data for training the model. There are two input arrays to the model: one for photo features and one for the encoded text. There is one output for the model which is the encoded next word in the text sequence.

The input text is encoded as integers, which will be fed to a word embedding layer. The photo features will be fed directly to another part of the model. The model will output a prediction, which will be a probability distribution over all words in the vocabulary.

The output data will therefore be a one-hot encoded version of each word, representing an idealized probability distribution with 0 values at all word positions except the actual word position, which has a value of 1.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

# create sequences of images, input sequences and output words for an image

def create_sequences(tokenizer,max_length,descriptions,photos):

X1,X2,y=list(),list(),list()

# walk through each image identifier

forkey,desc_list indescriptions.items():

# walk through each description for the image

fordesc indesc_list:

# encode the sequence

seq=tokenizer.texts_to_sequences([desc])[0]

# split one sequence into multiple X,y pairs

foriinrange(1,len(seq)):

# split into input and output pair

in_seq,out_seq=seq[:i],seq[i]

# pad input sequence

in_seq=pad_sequences([in_seq],maxlen=max_length)[0]

# encode output sequence

out_seq=to_categorical([out_seq],num_classes=vocab_size)[0]

# store

X1.append(photos[key][0])

X2.append(in_seq)

y.append(out_seq)

returnarray(X1),array(X2),array(y)

We will need to calculate the maximum number of words in the longest description. A short helper function named max_length() is defined below.

1

2

3

4

# calculate the length of the description with the most words

def max_length(descriptions):

lines=to_lines(descriptions)

returnmax(len(d.split())fordinlines)

We now have enough to load the data for the training and development datasets and transform the loaded data into input-output pairs for fitting a deep learning model.

Defining the Model

We will define a deep learning based on the “merge-model” described by Marc Tanti, et al. in their 2017 papers:

Photo Feature Extractor. This is a 16-layer VGG model pre-trained on the ImageNet dataset. We have pre-processed the photos with the VGG model (without the output layer) and will use the extracted features predicted by this model as input.

Sequence Processor. This is a word embedding layer for handling the text input, followed by a Long Short-Term Memory (LSTM) recurrent neural network layer.

Decoder (for lack of a better name). Both the feature extractor and sequence processor output a fixed-length vector. These are merged together and processed by a Dense layer to make a final prediction.

The Photo Feature Extractor model expects input photo features to be a vector of 4,096 elements. These are processed by a Dense layer to produce a 256 element representation of the photo.

The Sequence Processor model expects input sequences with a pre-defined length (34 words) which are fed into an Embedding layer that uses a mask to ignore padded values. This is followed by an LSTM layer with 256 memory units.

Both the input models produce a 256 element vector. Further, both input models use regularization in the form of 50% dropout. This is to reduce overfitting the training dataset, as this model configuration learns very fast.

The Decoder model merges the vectors from both input models using an addition operation. This is then fed to a Dense 256 neuron layer and then to a final output Dense layer that makes a softmax prediction over the entire output vocabulary for the next word in the sequence.

The function below named define_model() defines and returns the model ready to be fit.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

# define the captioning model

def define_model(vocab_size,max_length):

# feature extractor model

inputs1=Input(shape=(4096,))

fe1=Dropout(0.5)(inputs1)

fe2=Dense(256,activation='relu')(fe1)

# sequence model

inputs2=Input(shape=(max_length,))

se1=Embedding(vocab_size,256,mask_zero=True)(inputs2)

se2=Dropout(0.5)(se1)

se3=LSTM(256)(se2)

# decoder model

decoder1=add([fe2,se3])

decoder2=Dense(256,activation='relu')(decoder1)

outputs=Dense(vocab_size,activation='softmax')(decoder2)

# tie it together [image, seq] [word]

model=Model(inputs=[inputs1,inputs2],outputs=outputs)

model.compile(loss='categorical_crossentropy',optimizer='adam')

# summarize model

print(model.summary())

plot_model(model,to_file='model.png',show_shapes=True)

returnmodel

To get a sense for the structure of the model, specifically the shapes of the layers, see the summary listed below.

We also create a plot to visualize the structure of the network that better helps understand the two streams of input.

Plot of the Caption Generation Deep Learning Model

Fitting the Model

Now that we know how to define the model, we can fit it on the training dataset.

The model learns fast and quickly overfits the training dataset. For this reason, we will monitor the skill of the trained model on the holdout development dataset. When the skill of the model on the development dataset improves at the end of an epoch, we will save the whole model to file.

At the end of the run, we can then use the saved model with the best skill on the training dataset as our final model.

We can do this by defining a ModelCheckpoint in Keras and specifying it to monitor the minimum loss on the validation dataset and save the model to a file that has both the training and validation loss in the filename.

Evaluate Model

Once the model is fit, we can evaluate the skill of its predictions on the holdout test dataset.

We will evaluate a model by generating descriptions for all photos in the test dataset and evaluating those predictions with a standard cost function.

First, we need to be able to generate a description for a photo using a trained model.

This involves passing in the start description token ‘startseq‘, generating one word, then calling the model recursively with generated words as input until the end of sequence token is reached ‘endseq‘ or the maximum description length is reached.

The function below named generate_desc() implements this behavior and generates a textual description given a trained model, and a given prepared photo as input. It calls the function word_for_id() in order to map an integer prediction back to a word.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

# map an integer to a word

def word_for_id(integer,tokenizer):

forword,index intokenizer.word_index.items():

ifindex==integer:

returnword

returnNone

# generate a description for an image

def generate_desc(model,tokenizer,photo,max_length):

# seed the generation process

in_text='startseq'

# iterate over the whole length of the sequence

foriinrange(max_length):

# integer encode input sequence

sequence=tokenizer.texts_to_sequences([in_text])[0]

# pad input

sequence=pad_sequences([sequence],maxlen=max_length)

# predict next word

yhat=model.predict([photo,sequence],verbose=0)

# convert probability to integer

yhat=argmax(yhat)

# map integer to word

word=word_for_id(yhat,tokenizer)

# stop if we cannot map the word

ifwordisNone:

break

# append as input for generating the next word

in_text+=' '+word

# stop if we predict the end of the sequence

ifword=='endseq':

break

returnin_text

We will generate predictions for all photos in the test dataset and in the train dataset.

The function below named evaluate_model() will evaluate a trained model against a given dataset of photo descriptions and photo features. The actual and predicted descriptions are collected and evaluated collectively using the corpus BLEU score that summarizes how close the generated text is to the expected text.

We can put all of this together with the functions from the previous section for loading the data. We first need to load the training dataset in order to prepare a Tokenizer so that we can encode generated words as input sequences for the model. It is critical that we encode the generated words using exactly the same encoding scheme as was used when training the model.

We can see that the scores fit within and close to the top of the expected range of a skillful model on the problem. The chosen model configuration is by no means optimized.

1

2

3

4

BLEU-1: 0.579114

BLEU-2: 0.344856

BLEU-3: 0.252154

BLEU-4: 0.131446

Generate New Captions

Now that we know how to develop and evaluate a caption generation model, how can we use it?

Almost everything we need to generate captions for entirely new photographs is in the model file.

We also need the Tokenizer for encoding generated words for the model while generating a sequence, and the maximum length of input sequences, used when we defined the model (e.g. 34).

We can hard code the maximum sequence length. With the encoding of text, we can create the tokenizer and save it to a file so that we can load it quickly whenever we need it without needing the entire Flickr8K dataset. An alternative would be to use our own vocabulary file and mapping to integers function during training.

We can create the Tokenizer as before and save it as a pickle file tokenizer.pkl. The complete example is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

from keras.preprocessing.text import Tokenizer

from pickle import dump

# load doc into memory

def load_doc(filename):

# open the file as read only

file=open(filename,'r')

# read all text

text=file.read()

# close the file

file.close()

returntext

# load a pre-defined list of photo identifiers

def load_set(filename):

doc=load_doc(filename)

dataset=list()

# process line by line

forline indoc.split('\n'):

# skip empty lines

iflen(line)<1:

continue

# get the image identifier

identifier=line.split('.')[0]

dataset.append(identifier)

returnset(dataset)

# load clean descriptions into memory

def load_clean_descriptions(filename,dataset):

# load document

doc=load_doc(filename)

descriptions=dict()

forline indoc.split('\n'):

# split line by white space

tokens=line.split()

# split id from description

image_id,image_desc=tokens[0],tokens[1:]

# skip images not in the set

ifimage_id indataset:

# create list

ifimage_id notindescriptions:

descriptions[image_id]=list()

# wrap description in tokens

desc='startseq '+' '.join(image_desc)+' endseq'

# store

descriptions[image_id].append(desc)

returndescriptions

# covert a dictionary of clean descriptions to a list of descriptions

def to_lines(descriptions):

all_desc=list()

forkey indescriptions.keys():

[all_desc.append(d)fordindescriptions[key]]

returnall_desc

# fit a tokenizer given caption descriptions

def create_tokenizer(descriptions):

lines=to_lines(descriptions)

tokenizer=Tokenizer()

tokenizer.fit_on_texts(lines)

returntokenizer

# load training dataset (6K)

filename='Flickr8k_text/Flickr_8k.trainImages.txt'

train=load_set(filename)

print('Dataset: %d'%len(train))

# descriptions

train_descriptions=load_clean_descriptions('descriptions.txt',train)

print('Descriptions: train=%d'%len(train_descriptions))

# prepare tokenizer

tokenizer=create_tokenizer(train_descriptions)

# save the tokenizer

dump(tokenizer,open('tokenizer.pkl','wb'))

We can now load the tokenizer whenever we need it without having to load the entire training dataset of annotations.

Now, let’s generate a description for a new photograph.

Below is a new photograph that I chose randomly on Flickr (available under a permissive license).

Download the photograph and save it to your local directory with the filename “example.jpg“.

First, we must load the Tokenizer from tokenizer.pkl and define the maximum length of the sequence to generate, needed for padding inputs.

1

2

3

4

# load the tokenizer

tokenizer=load(open('tokenizer.pkl','rb'))

# pre-define the max sequence length (from training)

max_length=34

Then we must load the model, as before.

1

2

# load the model

model=load_model('model-ep002-loss3.245-val_loss3.612.h5')

Next, we must load the photo we which to describe and extract the features.

We could do this by re-defining the model and adding the VGG-16 model to it, or we can use the VGG model to predict the features and use them as inputs to our existing model. We will do the latter and use a modified version of the extract_features() function used during data preparation, but adapted to work on a single photo.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

# extract features from each photo in the directory

def extract_features(filename):

# load the model

model=VGG16()

# re-structure the model

model.layers.pop()

model=Model(inputs=model.inputs,outputs=model.layers[-1].output)

# load the photo

image=load_img(filename,target_size=(224,224))

# convert the image pixels to a numpy array

image=img_to_array(image)

# reshape data for the model

image=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))

# prepare the image for the VGG model

image=preprocess_input(image)

# get features

feature=model.predict(image,verbose=0)

returnfeature

# load and prepare the photograph

photo=extract_features('example.jpg')

We can then generate a description using the generate_desc() function defined when evaluating the model.

The complete example for generating a description for an entirely new standalone photograph is listed below.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

from pickle import load

from numpy import argmax

from keras.preprocessing.sequence import pad_sequences

from keras.applications.vgg16 import VGG16

from keras.preprocessing.image import load_img

from keras.preprocessing.image import img_to_array

from keras.applications.vgg16 import preprocess_input

from keras.models import Model

from keras.models import load_model

# extract features from each photo in the directory

def extract_features(filename):

# load the model

model=VGG16()

# re-structure the model

model.layers.pop()

model=Model(inputs=model.inputs,outputs=model.layers[-1].output)

# load the photo

image=load_img(filename,target_size=(224,224))

# convert the image pixels to a numpy array

image=img_to_array(image)

# reshape data for the model

image=image.reshape((1,image.shape[0],image.shape[1],image.shape[2]))

# prepare the image for the VGG model

image=preprocess_input(image)

# get features

feature=model.predict(image,verbose=0)

returnfeature

# map an integer to a word

def word_for_id(integer,tokenizer):

forword,index intokenizer.word_index.items():

ifindex==integer:

returnword

returnNone

# generate a description for an image

def generate_desc(model,tokenizer,photo,max_length):

# seed the generation process

in_text='startseq'

# iterate over the whole length of the sequence

foriinrange(max_length):

# integer encode input sequence

sequence=tokenizer.texts_to_sequences([in_text])[0]

# pad input

sequence=pad_sequences([sequence],maxlen=max_length)

# predict next word

yhat=model.predict([photo,sequence],verbose=0)

# convert probability to integer

yhat=argmax(yhat)

# map integer to word

word=word_for_id(yhat,tokenizer)

# stop if we cannot map the word

ifwordisNone:

break

# append as input for generating the next word

in_text+=' '+word

# stop if we predict the end of the sequence

ifword=='endseq':

break

returnin_text

# load the tokenizer

tokenizer=load(open('tokenizer.pkl','rb'))

# pre-define the max sequence length (from training)

max_length=34

# load the model

model=load_model('model-ep002-loss3.245-val_loss3.612.h5')

# load and prepare the photograph

photo=extract_features('example.jpg')

# generate description

description=generate_desc(model,tokenizer,photo,max_length)

print(description)

In this case, the description generated was as follows:

1

startseq dog is running across the beach endseq

You could remove the start and end tokens and you would have the basis for a nice automatic photo captioning model.

It’s like living in the future guys!

It still completely blows my mind that we can do this. Wow.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Alternate Pre-Trained Photo Models. A small 16-layer VGG model was used for feature extraction. Consider exploring larger models that offer better performance on the ImageNet dataset, such as Inception.

Smaller Vocabulary. A larger vocabulary of nearly eight thousand words was used in the development of the model. Many of the words supported may be misspellings or only used once in the entire dataset. Refine the vocabulary and reduce the size, perhaps by half.

Pre-trained Word Vectors. The model learned the word vectors as part of fitting the model. Better performance may be achieved by using word vectors either pre-trained on the training dataset or trained on a much larger corpus of text, such as news articles or Wikipedia.

Tune Model. The configuration of the model was not tuned on the problem. Explore alternate configurations and see if you can achieve better performance.

Did you try any of these extensions? Share your results in the comments below.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

69 Responses to How to Develop a Deep Learning Photo Caption Generator from Scratch

My results after training were a bit worse (loss 3.566 – val_loss 3.859, then started to overfit) so i decided to try keras.applications.inception_v3.InceptionV3 for the base model. Currently it is still running and i am curious to see if it will do better.

Hi Jason,
Once again great Article.
I ran into some error while executing the code under “Complete example ” section.
The error I got was
ValueError: Error when checking target: expected dense_3 to have shape (None, 7579) but got array with shape (306404, 1)
Any idea how to fix this?
Thanks

In the prepare data section, if using Python 2.7 there is no str.maketrans method.
To make this work just comment that line and in line 46 do this:
desc = [w.translate(None, string.punctuation) for w in desc]

Currently I am trying to re-implement the whole code, except that I am doing it in pure Tensorflow. I’m curious to see if my re-implementation is working as smooth as yours.

Also a shower thought, it might be better to get a better vector representations for words if using the pretrained word2vec embeddings, for example Glove 6B or GoogleNews. Learning embeddings from scratch with only 8k words might have some performance loss.

Again thank you for putting everything together, it will take quite some time to implement from scratch without your tutorial.

Hi Jason! Thanks for your amazing tutorial! I have a question. I don’t understand the meaning of the number 1 on this line (extract_features):
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

Can you explain me what reshape does and the meaning of the arguments?

Hi Jason, thanks for the tutorial! I want to ask you if you could explain (or send me some links), to better understand, how exactly the fitting works.

Example description: the girl is …

The LSTM network during fitting takes the beginning of the sequence of my description (startseq) and it produces a vector with all possible subsequent words. This vector is combined with the vector of the input image features and it is passed within an FF layer where we then take the most probable word (with softmax). it’s right?

At this point how does the fitting go on? Is the new sequence (e.g startseq – the) passed into the LSTM network, predicts all possible next words, etc.? Continuing this way up to endseq?

If the network incorrectly generates the next word, what happens? How are the weights arranged? The fitting continues by taking in input “startseq – wrong_word” or continues with the correct one (eg startseq – the)?

Hi Jason great article on caption generator i think the best till now available online.. i am a newbee in ML(AI). i extracted the features and stored it to features.pkl file but getting an error on create sequence functions memory error and i can see you have suggested progressive loading i do not get that properly could you suggest my how to use the current code modified for progressive loading::

And i got the error
ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (13, 4485)

Then i updated to_categorical function as you mentioned and the error changed to this
ValueError: Error when checking target: expected dense_3 to have 4 dimensions, but got array with shape (13, 1, 4485)

Been trying to figure out the exact input shapes of the model since 2 days please help 🙁