# Pytorch Implementation​: Siamese Network

## Project Goal

This project uses the Labeled Faces in the Wild (LFW) dataset, and the goal is to train variants of deep architectures to learn when a pair of images of faces is the same person or not. It is a Pytorch implementation of Siamese network with 19 layers.

## Method

Consider a Convolutional Neural Network (CNN), denoted by C, that takes as input a single image I and outputs a feature vector $f \in R^N$, where f is simply the output of the final fully connected layer that contains N nodes (and hence, N numbers are produced). In other words, $C(I) = f \in R^N$. A Siamese Network is a CNN that takes two separate image inputs, I1 and I2, and both images go through the same exact CNN C (e.g., this is what’s called “shared weights”), and so we can say:$C(I_1) = f_1$ and $C(I_2) = f_2$ (NOTE that we use the same C here, not two different C’s!! Only one network — this is key!). We can process the images in one pass, and then we can “play games” with the features that are output to learn what it means for two images (of faces) represented by their respective ‘latent embeddings’ (f) to be the same person/identity or not. Here’s the base architecture we will use throughout. Let’s call this C:

## Network Architecture

1. Convolution Layer

Input: image of size 128x128x3 (3 for RGB color channels)
Parameters: 64 features of size 5x5, stride of (1,1), padding=2
Output: tensor of size 128x128x64

2. ReLU (in-place)

3. Batch Normalization Layer

64 features


4. Max pooling layer

Input: tensor of size 128x128x64
Parameters: none, perform 2x2 max-pooling with (2,2) stride
Output: tensor of size 64x64x64


5. Convolution Layer

Input: 64x64x64
Parameters: 128 features of size 5x5, stride of (1,1), padding=2
Output: tensor of size 64x64x128


6. ReLU (in-place)

7. Batch Normalization Layer

128 features


8. Max pooling layer

Input: tensor of size 64x64x128
Parameters: none, perform 2x2 max-pooling with (2,2) stride
Output: tensor of size 32x32x128


9. Convolution Layer

Input: 32x32x128
Parameters: 256 features of size 3x3, stride of (1,1), padding=1
Output: tensor of size 32x32x256


10. ReLU (in-place)

11. Batch Normalization Layer

256 features


12. Max pooling layer

Input: tensor of size 32x32x256
Parameters: none, perform 2x2 max-pooling with (2,2) stride
Output: tensor of size 16x16x256


13. Convolution Layer

Input: 16x16x256
Parameters: 512 features of size 3x3, stride of (1,1), padding=1
Output: tensor of size 16x16x512


14. ReLU (in-place)

15. Batch Normalization Layer

512 features


16. Flatten Layer

Input: tensor of size 16x16x512
Parameters: none, simply flatten the tensor into 1-D
Output: vector of size 16x16x512=131072


Note: this simple layer doesn’t exist in Pytorch. You’ll have to use view(), or implement it yourself.

17. Fully Connected Layer (aka Linear Layer)

Input: vector of size 131072
Parameters: fully-connected layer with 1024 nodes
Output: vector of size 1024


18. ReLU (in-place)

19. Batch Normalization Layer

1024 features


For this network C, a feature f is produced as the output of layer number 19, of size 1024.

## Loss Function

a. Binary-classification with BCE: A train/test split (each) as a text file is provided with rows containing image files paths of two images and then a 1 if they are the same person and 0 if not. I will read in these images and train the network as follows:

Given the network topology described above, construct a siamese network that takes as input a pair of images into this shared network. Each image (from the pair) is run through C individually, producing $f_1$ and $f_2$. Combine $f_1$ and $f_2$ through concatenation to produce $f_{12}$ (e.g., take the elements of $f_1$ and then combine the elements of $f_2$ to the end of $f_1$ so that $f_{12}$ has dimensions 2*1024=2048). Then take the output of $f_{12}$ and add one more fully-connected layer to classify this as 0 or 1. This output layer will have 1 node (e.g., only one number is output) and should have a sigmoid activation (so that the outputs are clipped in the range [0.0, 1.0]). The output of this last layer then represents the probability that this pair of images is the same (it’ll be close to 1 if they are the same person and close to 0 if not).

b. Contrastive-loss: Now we will approach the problem in a slightly different way. Rather than attempt to directly classify the pair as 0 or 1, we will train the network to learn an embedding that attempts to bring features of same people closer to each other and separate apart people with different identities (e.g., the features representing all images of, say, “Brad Pitt” should be very close to each other, but far away from feature clusters of all other people). The way we will do this is to have the network learn a similarity metric. In other words, it’s reasonable to assume just taking the L2-distance between features would achieve this (as in K-Means), and so we will train the network to optimize this objective to yield the best embedding that groups features via L2-distance. In the end, we will train with what’s called contrastive loss, which is a custom loss function as follows. Suppose we define:

$D_w(I_1,I_2) = {\mid \mid C(I_1) - C (I_2 )\mid \mid}_2 = {\mid \mid f_1 - f_2 \mid \mid}_2$

(e.g., the Euclidean distance between features produced by C, given the pair of input images). We then define our custom objective function L as:

$L(I_1, I_2 , y ) = y * {D_w(I_1,I_2 )}^2 + (1 - y ) * {max(0, m - D_w (I_1 , I_2 ))}^2$

where $I_1$ and $I_2$ are the input image pairs, y is the ground truth label (1 if they are the same and 0 if they are different), and m is a margin term that nudges the network to separate features even more. You can start with a margin of 1.0, but feel free to use other margins. The network is trained to minimize this term. Use the same base network as before. The only difference here is instead of concatenating features and doing binary classification, we are training to enforce structure on the data (think of it as supervised clustering). Use the same train/test split as before.

## Where is the code?

As always, check out my GitHub:

https://github.com/stytim/LFW_Siamese_Pytorch