Project Goal
This project uses the Labeled Faces in the Wild (LFW) dataset, and the goal is to train variants of deep architectures to learn when a pair of images of faces is the same person or not. It is a Pytorch implementation of Siamese network with 19 layers.
Method
Consider a Convolutional Neural Network (CNN), denoted by C, that takes as input a single image I and outputs a feature vector , where f is simply the output of the final fully connected layer that contains N nodes (and hence, N numbers are produced). In other words,
. A Siamese Network is a CNN that takes two separate image inputs, I1 and I2, and both images go through the same exact CNN C (e.g., this is what’s called “shared weights”), and so we can say:
and
(NOTE that we use the same C here, not two different C’s!! Only one network — this is key!). We can process the images in one pass, and then we can “play games” with the features that are output to learn what it means for two images (of faces) represented by their respective ‘latent embeddings’ (f) to be the same person/identity or not. Here’s the base architecture we will use throughout. Let’s call this C:
Network Architecture
1. Convolution Layer
Input: image of size 128x128x3 (3 for RGB color channels) Parameters: 64 features of size 5x5, stride of (1,1), padding=2 Output: tensor of size 128x128x64
2. ReLU (in-place)
3. Batch Normalization Layer
64 features
4. Max pooling layer
Input: tensor of size 128x128x64 Parameters: none, perform 2x2 max-pooling with (2,2) stride Output: tensor of size 64x64x64
5. Convolution Layer
Input: 64x64x64 Parameters: 128 features of size 5x5, stride of (1,1), padding=2 Output: tensor of size 64x64x128
6. ReLU (in-place)
7. Batch Normalization Layer
128 features
8. Max pooling layer
Input: tensor of size 64x64x128 Parameters: none, perform 2x2 max-pooling with (2,2) stride Output: tensor of size 32x32x128
9. Convolution Layer
Input: 32x32x128 Parameters: 256 features of size 3x3, stride of (1,1), padding=1 Output: tensor of size 32x32x256
10. ReLU (in-place)
11. Batch Normalization Layer
256 features
12. Max pooling layer
Input: tensor of size 32x32x256 Parameters: none, perform 2x2 max-pooling with (2,2) stride Output: tensor of size 16x16x256
13. Convolution Layer
Input: 16x16x256 Parameters: 512 features of size 3x3, stride of (1,1), padding=1 Output: tensor of size 16x16x512
14. ReLU (in-place)
15. Batch Normalization Layer
512 features
16. Flatten Layer
Input: tensor of size 16x16x512 Parameters: none, simply flatten the tensor into 1-D Output: vector of size 16x16x512=131072
Note: this simple layer doesn’t exist in Pytorch. You’ll have to use view(), or implement it yourself.
17. Fully Connected Layer (aka Linear Layer)
Input: vector of size 131072 Parameters: fully-connected layer with 1024 nodes Output: vector of size 1024
18. ReLU (in-place)
19. Batch Normalization Layer
1024 features
For this network C, a feature f is produced as the output of layer number 19, of size 1024.
Loss Function
a. Binary-classification with BCE: A train/test split (each) as a text file is provided with rows containing image files paths of two images and then a 1 if they are the same person and 0 if not. I will read in these images and train the network as follows:
Given the network topology described above, construct a siamese network that takes as input a pair of images into this shared network. Each image (from the pair) is run through C individually, producing and
. Combine
and
through concatenation to produce
(e.g., take the elements of
and then combine the elements of
to the end of
so that
has dimensions 2*1024=2048). Then take the output of
and add one more fully-connected layer to classify this as 0 or 1. This output layer will have 1 node (e.g., only one number is output) and should have a sigmoid activation (so that the outputs are clipped in the range [0.0, 1.0]). The output of this last layer then represents the probability that this pair of images is the same (it’ll be close to 1 if they are the same person and close to 0 if not).
b. Contrastive-loss: Now we will approach the problem in a slightly different way. Rather than attempt to directly classify the pair as 0 or 1, we will train the network to learn an embedding that attempts to bring features of same people closer to each other and separate apart people with different identities (e.g., the features representing all images of, say, “Brad Pitt” should be very close to each other, but far away from feature clusters of all other people). The way we will do this is to have the network learn a similarity metric. In other words, it’s reasonable to assume just taking the L2-distance between features would achieve this (as in K-Means), and so we will train the network to optimize this objective to yield the best embedding that groups features via L2-distance. In the end, we will train with what’s called contrastive loss, which is a custom loss function as follows. Suppose we define:
(e.g., the Euclidean distance between features produced by C, given the pair of input images). We then define our custom objective function L as:
where and
are the input image pairs, y is the ground truth label (1 if they are the same and 0 if they are different), and m is a margin term that nudges the network to separate features even more. You can start with a margin of 1.0, but feel free to use other margins. The network is trained to minimize this term. Use the same base network as before. The only difference here is instead of concatenating features and doing binary classification, we are training to enforce structure on the data (think of it as supervised clustering). Use the same train/test split as before.
Where is the code?
As always, check out my GitHub: