One-Shot Image Verification using deep learning

Since the Olivetti face database is not large enough to do well training the network and the architecture of this network is really optimized for character recognition, it probably makes more sense to use a pretrained face recognition network for encoding the images. The result of this encoding will then be compared between the two and the final stage will be trained to complete the discrimination. Weights for the final layer can be loaded into the complete model after training that layer alone.

The best choice for pretrained face recognition networks is probably the VGG-Face network, as described in this paper - http://www.robots.ox.ac.uk/~vgg/publications/2015/Parkhi15/parkhi15.pdf, which has already been converted to Keras here - https://gist.github.com/EncodeTS/6bbe8cb8bebad7a672f0d872561782d9.

One difficulty may be the differning format between the VGG face dataset and the Olivetti set. To accomodate the different format and grayscale the Olivetti images will be adjusted to match the expected input with a similar placement of the face in the image and the grayscale repeated on all three rbg planes.

The VGG-Faces based siamese model will be fine tuned to the first 30 of the Olivetti faces and the remaining 10 will be used to test one-shot accuracy.

As I was trying to determine the way faces should be framed in the image I found this paper which describes adapting the VGG-Faces model for cross-over learning - http://cs231n.stanford.edu/reports/2016/pdfs/006_Report.pdf

It seems that the Olivetti faces are probably reasonably well cropped, so we will only need to change them to RGB format for the input.

One thing I did not understand at the outset was that the model does not actually learn anything from the one image. All features need to be encoded in the network to begin with. So it is critically important that a wide range of faces be present in the training data. For this reason the VGG-Face dataset is probably not sufficient for the wider population as it was mostly trained on celebrity faces.

I was looking for how to merge the results from the branches and found this blog post on how someone implemented the model in the paper - https://sorenbouma.github.io/blog/oneshot/ So, some of the following code is adapted from that.

If we use 75% of the Olivetti Faces data set for training, the number of positive and negative pairs we can get is 9X5X30 = 1350 and 150x290 = 43,500.

We will draw evenly and randomly from those sets for training batches and run until we are not seeing much additional improvement. For validation we will have 450 positive cases and 4500 negative cases. The validation data will be used to assess the training of the final layers and to construct the one-shot tests.

Seems that for training the final layer it may be faster to precompute the output of the VGG-Face branches for all the images and train the final layer on its own. I will see if I can figure out how to do that once I have things set up.

If the performance is not so good with only the final fully connected layer trained we could also make the second to last layer trainable and include it in the discriminator, with initial wieghts starting where they were for VGG-Face.

To multiply the amount of data we have we might also just do some transformations on the image vectors for training the final layer rather than on the original images. Perhaps a dropout of 10% of the components. If we were to start from the original images we could add noise or perform affine distortions that approximate changes in the angle of the photo or lighting without creating abnormal facial shapes.

Improved source for VGG-Face

After mucking around with VGG-Face trying to get it to work I think I found a better source for a Keras-based VGG-Face model ready for fine tuning here - https://github.com/rcmalli/keras-vggface. It has the weights in TensorFlow format.

AWS

Although it was not really necessary due to the small amount of data and single training layer, training was performed on an AWS p2 instance. If we had progressed to training more of the fully connected layers it may have been a help to have the GPU machine.

ROC Curve

Not too bad for only 40 Epochs of training on the final layer. This is an indication of performance for a 1 of 2 one-shot learning task. So not really very good, but let's go on to check the one-shot performance for larger sets.

So, accuracy drops off pretty quickly

Summary Notes:

What an awesome experience this was

Steps for improvement