Python Project - Learn to Build Image Caption Generator with CNN & LSTM
Posted by Superadmin on August 22 2020 15:02:15

Python Project - Learn to Build Image Caption Generator with CNN & LSTM

BY  · UPDATED · AUGUST 7, 2020

 

Project based on Python – Image Caption Generator 

 

You saw an image and your brain can easily tell what the image is about, but can a computer tell what the image is representing? Computer vision researchers worked on this a lot and they considered it impossible until now! With the advancement in Deep learning techniques, availability of huge datasets and computer power, we can build models that can generate captions for an image.

This is what we are going to implement in this Python based project where we will use deep learning techniques of Convolutional Neural Networks and a type of Recurrent Neural Network (LSTM) together.

 

 

What is Image Caption Generator?

Image caption generator is a task that involves computer vision and natural language processing concepts to recognize the context of an image and describe them in a natural language like English.

Image Caption Generator with CNN – About the Python based Project

The objective of our project is to learn the concepts of a CNN and LSTM model and build a working model of Image caption generator by implementing CNN with LSTM.

In this Python project, we will be implementing the caption generator using CNN (Convolutional Neural Networks) and LSTM (Long short term memory). The image features will be extracted from Xception which is a CNN model trained on the imagenet dataset and then we feed the features into the LSTM model which will be responsible for generating the image captions.

The Dataset of Python based Project

For the image caption generator, we will be using the Flickr_8K dataset. There are also other big datasets like Flickr_30K and MSCOCO dataset but it can take weeks just to train the network so we will be using a small Flickr8k dataset. The advantage of a huge dataset is that we can build better models.

Thanks to Jason Brownlee for providing a direct link to download the dataset (Size: 1GB).

The Flickr_8k_text folder contains file Flickr8k.token which is the main file of our dataset that contains image name and their respective captions separated by newline(“\n”).

Pre-requisites

This project requires good knowledge of Deep learning, Python, working on Jupyter notebooks, Keras library, Numpy, and Natural language processing.

Make sure you have installed all the following necessary libraries:

Image Caption Generator – Python based Project


What is CNN?

Convolutional Neural networks are specialized deep neural networks which can process the data that has input shape like a 2D matrix. Images are easily represented as a 2D matrix and CNN is very useful in working with images.

CNN is basically used for image classifications and identifying if an image is a bird, a plane or Superman, etc.

working of Deep CNN - Python based project

It scans images from left to right and top to bottom to pull out important features from the image and combines the feature to classify images. It can handle the images that have been translated, rotated, scaled and changes in perspective.

 

What is LSTM?

LSTM stands for Long short term memory, they are a type of RNN (recurrent neural network) which is well suited for sequence prediction problems. Based on the previous text, we can predict what the next word will be. It has proven itself effective from the traditional RNN by overcoming the limitations of RNN which had short term memory. LSTM can carry out relevant information throughout the processing of inputs and with a forget gate, it discards non-relevant information.

This is what an LSTM cell looks like –

LSTM Cell Structure - simple python project

Image Caption Generator Model

So, to make our image caption generator model, we will be merging these architectures. It is also called a CNN-RNN model.

Model of Image Caption Generator - python based project

Project File Structure

Downloaded from dataset:

The below files will be created by us while making the project.

You can download all the files from the link:

Image Caption Generator – Python Project Files

structure - python based project

 

Building the Python based Project

Let’s start by initializing the jupyter notebook server by typing jupyter lab in the console of your project folder. It will open up the interactive Python notebook where you can run your code. Create a Python3 notebook and name it training_caption_generator.ipynb

jupyter lab - python based project

1. First, we import all the necessary packages

  1. import string
  2. import numpy as np
  3. from PIL import Image
  4. import os
  5. from pickle import dump, load
  6. import numpy as np
  7.  
  8. from keras.applications.xception import Xception, preprocess_input
  9. from keras.preprocessing.image import load_img, img_to_array
  10. from keras.preprocessing.text import Tokenizer
  11. from keras.preprocessing.sequence import pad_sequences
  12. from keras.utils import to_categorical
  13. from keras.layers.merge import add
  14. from keras.models import Model, load_model
  15. from keras.layers import Input, Dense, LSTM, Embedding, Dropout
  16.  
  17.  # small library for seeing the progress of loops. 
  18. from tqdm import tqdm_notebook as tqdm
  19.  tqdm().pandas()

2. Getting and performing data cleaning

The main text file which contains all image captions is Flickr8k.token in our Flickr_8k_text folder.

Have a look at the file –

token file - project in python

The format of our file is image and caption separated by a new line (“\n”).

Each image has 5 captions and we can see that #(0 to 5)number is assigned for each caption.

We will define 5 functions:

descriptions - python based project

save descriptions - python project

Code :

  1. # Loading a text file into memory 
  2. def load_doc(filename):
  3. # Opening the file as read only 
  4. file = open(filename, 'r') 
  5. text = file.read() 
  6. file.close() 
  7. return text
  8.  
  9.  # get all imgs with their captions 
  10. def all_img_captions(filename):
  11. file = load_doc(filename) 
  12. captions = file.split('\n') 
  13. descriptions ={}
  14. for caption in captions[:-1]:
  15. img, caption = caption.split('\t')
  16. if img[:-2] not in descriptions:
  17. descriptions[img[:-2]] = [ caption ]
  18. else:
  19. descriptions[img[:-2]].append(caption)
  20. return descriptions
  21. #Data cleaning- lower casing, removing puntuations and words containing numbers
  22. def cleaning_text(captions):
  23. table = str.maketrans('','',string.punctuation)
  24. for img,caps in captions.items():
  25. for i,img_caption in enumerate(caps):
  26. img_caption.replace("-"," ")
  27. desc = img_caption.split()
  28. #converts to lowercase
  29. desc = [word.lower() for word in desc]
  30. #remove punctuation from each token
  31. desc = [word.translate(table) for word in desc]
  32. #remove hanging 's and a
  33. desc = [word for word in desc if(len(word)>1)]
  34. #remove tokens with numbers in them
  35. desc = [word for word in desc if(word.isalpha())]
  36. #convert back to string
  37. img_caption = ' '.join(desc)
  38. captions[img][i]= img_caption
  39. return captions
  40. def text_vocabulary(descriptions):
  41. # build vocabulary of all unique words
  42. vocab = set()
  43. for key in descriptions.keys():
  44. [vocab.update(d.split()) for d in descriptions[key]]
  45. return vocab
  46. #All descriptions in one file
  47. def save_descriptions(descriptions, filename):
  48. lines = list()
  49. for key, desc_list in descriptions.items():
  50. for desc in desc_list:
  51. lines.append(key + '\t' + desc )
  52. data = "\n".join(lines)
  53. file = open(filename,"w")
  54. file.write(data)
  55. file.close()
  56. # Set these path according to project folder in you system
  57. dataset_text = "D:\dataflair projects\Project - Image Caption Generator\Flickr_8k_text"
  58. dataset_images = "D:\dataflair projects\Project - Image Caption Generator\Flicker8k_Dataset"
  59. #we prepare our text data
  60. filename = dataset_text + "/" + "Flickr8k.token.txt"
  61. #loading the file that contains all data
  62. #mapping them into descriptions dictionary img to 5 captions
  63. descriptions = all_img_captions(filename)
  64. print("Length of descriptions =" ,len(descriptions))
  65. #cleaning the descriptions
  66. clean_descriptions = cleaning_text(descriptions)
  67. #building vocabulary
  68. vocabulary = text_vocabulary(clean_descriptions)
  69. print("Length of vocabulary = ", len(vocabulary))
  70. #saving each description to file
  71. save_descriptions(clean_descriptions, "descriptions.txt")

3. Extracting the feature vector from all images 

This technique is also called transfer learning, we don’t have to do everything on our own, we use the pre-trained model that have been already trained on large datasets and extract the features from these models and use them for our tasks. We are using the Xception model which has been trained on imagenet dataset that had 1000 different classes to classify. We can directly import this model from the keras.applications . Make sure you are connected to the internet as the weights get automatically downloaded. Since the Xception model was originally built for imagenet, we will do little changes for integrating with our model. One thing to notice is that the Xception model takes 299*299*3 image size as input. We will remove the last classification layer and get the 2048 feature vector.

model = Xception( include_top=False, pooling=’avg’ )

The function extract_features() will extract features for all images and we will map image names with their respective feature array. Then we will dump the features dictionary into a “features.p” pickle file.

Code:

  1. def extract_features(directory):
  2. model = Xception( include_top=False, pooling='avg' )
  3. features = {}
  4. for img in tqdm(os.listdir(directory)):
  5. filename = directory + "/" + img
  6. image = Image.open(filename)
  7. image = image.resize((299,299))
  8. image = np.expand_dims(image, axis=0)
  9. #image = preprocess_input(image)
  10. image = image/127.5
  11. image = image - 1.0
  12. feature = model.predict(image)
  13. features[img] = feature
  14. return features
  15. #2048 feature vector
  16. features = extract_features(dataset_images)
  17. dump(features, open("features.p","wb"))

extracting features - python based project

This process can take a lot of time depending on your system. I am using an Nvidia 1050 GPU for training purpose so it took me around 7 minutes for performing this task. However, if you are using CPU then this process might take 1-2 hours. You can comment out the code and directly load the features from our pickle file.

  1. features = load(open("features.p","rb"))

4. Loading dataset for Training the model

In our Flickr_8k_test folder, we have Flickr_8k.trainImages.txt file that contains a list of 6000 image names that we will use for training.

For loading the training dataset, we need more functions:

Code :

  1. #load the data
  2. def load_photos(filename):
  3. file = load_doc(filename)
  4. photos = file.split("\n")[:-1]
  5. return photos
  6. def load_clean_descriptions(filename, photos):
  7. #loading clean_descriptions
  8. file = load_doc(filename)
  9. descriptions = {}
  10. for line in file.split("\n"):
  11. words = line.split()
  12. if len(words)<1 :
  13. continue
  14. image, image_caption = words[0], words[1:]
  15. if image in photos:
  16. if image not in descriptions:
  17. descriptions[image] = []
  18. desc = '<start> ' + " ".join(image_caption) + ' <end>'
  19. descriptions[image].append(desc)
  20. return descriptions
  21. def load_features(photos):
  22. #loading all features
  23. all_features = load(open("features.p","rb"))
  24. #selecting only needed features
  25. features = {k:all_features[k] for k in photos}
  26. return features
  27. filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
  28. #train = loading_data(filename)
  29. train_imgs = load_photos(filename)
  30. train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
  31. train_features = load_features(train_imgs)

5. Tokenizing the vocabulary 

Computers don’t understand English words, for computers, we will have to represent them with numbers. So, we will map each word of the vocabulary with a unique index value. Keras library provides us with the tokenizer function that we will use to create tokens from our vocabulary and save them to a “tokenizer.p” pickle file.

Code:

  1. #converting dictionary to clean list of descriptions
  2. def dict_to_list(descriptions):
  3. all_desc = []
  4. for key in descriptions.keys():
  5. [all_desc.append(d) for d in descriptions[key]]
  6. return all_desc
  7. #creating tokenizer class
  8. #this will vectorise text corpus
  9. #each integer will represent token in dictionary
  10. from keras.preprocessing.text import Tokenizer
  11. def create_tokenizer(descriptions):
  12. desc_list = dict_to_list(descriptions)
  13. tokenizer = Tokenizer()
  14. tokenizer.fit_on_texts(desc_list)
  15. return tokenizer
  16. # give each word an index, and store that into tokenizer.p pickle file
  17. tokenizer = create_tokenizer(train_descriptions)
  18. dump(tokenizer, open('tokenizer.p', 'wb'))
  19. vocab_size = len(tokenizer.word_index) + 1
  20. vocab_size

Our vocabulary contains 7577 words.

We calculate the maximum length of the descriptions. This is important for deciding the model structure parameters. Max_length of description is 32.

  1. #calculate maximum length of descriptions
  2. def max_length(descriptions):
  3. desc_list = dict_to_list(descriptions)
  4. return max(len(d.split()) for d in desc_list)
  5. max_length = max_length(descriptions)
  6. max_length

6. Create Data generator

Let us first see how the input and output of our model will look like. To make this task into a supervised learning task, we have to provide input and output to the model for training. We have to train our model on 6000 images and each image will contain 2048 length feature vector and caption is also represented as numbers. This amount of data for 6000 images is not possible to hold into memory so we will be using a generator method that will yield batches.

The generator will yield the input and output sequence.

For example:

The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of that image, x2 is the input text sequence and y is the output text sequence that the model has to predict.

x1(feature vector) x2(Text sequence) y(word to predict)
feature start, two
feature start, two dogs
feature start, two, dogs drink
feature start, two, dogs, drink water
feature start, two, dogs, drink, water end
  1. #create input-output sequence pairs from the image description.
  2. #data generator, used by model.fit_generator()
  3. def data_generator(descriptions, features, tokenizer, max_length):
  4. while 1:
  5. for key, description_list in descriptions.items():
  6. #retrieve photo features
  7. feature = features[key][0]
  8. input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)
  9. yield [[input_image, input_sequence], output_word]
  10. def create_sequences(tokenizer, max_length, desc_list, feature):
  11. X1, X2, y = list(), list(), list()
  12. # walk through each description for the image
  13. for desc in desc_list:
  14. # encode the sequence
  15. seq = tokenizer.texts_to_sequences([desc])[0]
  16. # split one sequence into multiple X,y pairs
  17. for i in range(1, len(seq)):
  18. # split into input and output pair
  19. in_seq, out_seq = seq[:i], seq[i]
  20. # pad input sequence
  21. in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
  22. # encode output sequence
  23. out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
  24. # store
  25. X1.append(feature)
  26. X2.append(in_seq)
  27. y.append(out_seq)
  28. return np.array(X1), np.array(X2), np.array(y)
  29. #You can check the shape of the input and output for your model
  30. [a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
  31. a.shape, b.shape, c.shape
  32. #((47, 2048), (47, 32), (47, 7577))

7. Defining the CNN-RNN model

To define the structure of the model, we will be using the Keras Model from Functional API. It will consist of three major parts:

Visual representation of the final model is given below –

final model - python data science project

  1. from keras.utils import plot_model
  2. # define the captioning model
  3. def define_model(vocab_size, max_length):
  4. # features from the CNN model squeezed from 2048 to 256 nodes
  5. inputs1 = Input(shape=(2048,))
  6. fe1 = Dropout(0.5)(inputs1)
  7. fe2 = Dense(256, activation='relu')(fe1)
  8. # LSTM sequence model
  9. inputs2 = Input(shape=(max_length,))
  10. se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
  11. se2 = Dropout(0.5)(se1)
  12. se3 = LSTM(256)(se2)
  13. # Merging both models
  14. decoder1 = add([fe2, se3])
  15. decoder2 = Dense(256, activation='relu')(decoder1)
  16. outputs = Dense(vocab_size, activation='softmax')(decoder2)
  17. # tie it together [image, seq] [word]
  18. model = Model(inputs=[inputs1, inputs2], outputs=outputs)
  19. model.compile(loss='categorical_crossentropy', optimizer='adam')
  20. # summarize model
  21. print(model.summary())
  22. plot_model(model, to_file='model.png', show_shapes=True)
  23. return model

8. Training the model

To train the model, we will be using the 6000 training images by generating the input and output sequences in batches and fitting them to the model using model.fit_generator() method. We also save the model to our models folder. This will take some time depending on your system capability.

  1. # train our model
  2. print('Dataset: ', len(train_imgs))
  3. print('Descriptions: train=', len(train_descriptions))
  4. print('Photos: train=', len(train_features))
  5. print('Vocabulary Size:', vocab_size)
  6. print('Description Length: ', max_length)
  7. model = define_model(vocab_size, max_length)
  8. epochs = 10
  9. steps = len(train_descriptions)
  10. # making a directory models to save our models
  11. os.mkdir("models")
  12. for i in range(epochs):
  13. generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
  14. model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)
  15. model.save("models/model_" + str(i) + ".h5")

9. Testing the model

The model has been trained, now, we will make a separate file testing_caption_generator.py which will load the model and generate predictions. The predictions contain the max length of index values so we will use the same tokenizer.p pickle file to get the words from their index values.

Code:

  1. import numpy as np
  2. from PIL import Image
  3. import matplotlib.pyplot as plt
  4. import argparse
  5. ap = argparse.ArgumentParser()
  6. ap.add_argument('-i', '--image', required=True, help="Image Path")
  7. args = vars(ap.parse_args())
  8. img_path = args['image']
  9. def extract_features(filename, model):
  10. try:
  11. image = Image.open(filename)
  12. except:
  13. print("ERROR: Couldn't open image! Make sure the image path and extension is correct")
  14. image = image.resize((299,299))
  15. image = np.array(image)
  16. # for images that has 4 channels, we convert them into 3 channels
  17. if image.shape[2] == 4:
  18. image = image[..., :3]
  19. image = np.expand_dims(image, axis=0)
  20. image = image/127.5
  21. image = image - 1.0
  22. feature = model.predict(image)
  23. return feature
  24. def word_for_id(integer, tokenizer):
  25. for word, index in tokenizer.word_index.items():
  26. if index == integer:
  27. return word
  28. return None
  29. def generate_desc(model, tokenizer, photo, max_length):
  30. in_text = 'start'
  31. for i in range(max_length):
  32. sequence = tokenizer.texts_to_sequences([in_text])[0]
  33. sequence = pad_sequences([sequence], maxlen=max_length)
  34. pred = model.predict([photo,sequence], verbose=0)
  35. pred = np.argmax(pred)
  36. word = word_for_id(pred, tokenizer)
  37. if word is None:
  38. break
  39. in_text += ' ' + word
  40. if word == 'end':
  41. break
  42. return in_text
  43. #path = 'Flicker8k_Dataset/111537222_07e56d5a30.jpg'
  44. max_length = 32
  45. tokenizer = load(open("tokenizer.p","rb"))
  46. model = load_model('models/model_9.h5')
  47. xception_model = Xception(include_top=False, pooling="avg")
  48. photo = extract_features(img_path, xception_model)
  49. img = Image.open(img_path)
  50. description = generate_desc(model, tokenizer, photo, max_length)
  51. print("\n\n")
  52. print(description)
  53. plt.imshow(img)

Results:

image caption generator - man standing on rock

image caption generator - girls playing

python project on image caption generator - man on kayak

Summary

In this advanced Python project, we have implemented a CNN-RNN model by building an image caption generator. Some key points to note are that our model depends on the data, so, it cannot predict the words that are out of its vocabulary. We used a small dataset consisting of 8000 images. For production-level models, we need to train on datasets larger than 100,000 images which can produce better accuracy models.