Pytorch Implementation​: Siamese Network


Project Goal

This project uses the Labeled Faces in the Wild (LFW) dataset, and the goal is to train variants of deep architectures to learn when a pair of images of faces is the same person or not. It is a Pytorch implementation of Siamese network with 19 layers.


Consider a Convolutional Neural Network (CNN), denoted by C, that takes as input a single image I and outputs a feature vector f ∈ ℜN, where f is simply the output of the final fully connected layer that contains N nodes (and hence, N numbers are produced). In other words, C (I) = f ∈ ℜN. A Siamese Network is a CNN that takes two separate image inputs, I1 and I2, and both images go through the same exact CNN C (e.g., this is what’s called “shared weights”), and so we can say: C (I1 ) = f1 and C (I2 ) = f2 (NOTE that we use the same C here, not two different C’s!! Only one network — this is key!). We can process the images in one pass, and then we can “play games” with the features that are output to learn what it means for two images (of faces) represented by their respective ‘latent embeddings’ (f) to be the same person/identity or not. Here’s the base architecture we will use throughout. Let’s call this C:

Network Architecture

1. Convolution Layer

Input: image of size 128x128x3 (3 for RGB color channels)
Parameters: 64 features of size 5×5, stride of (1,1), padding=2
Output: tensor of size 128x128x64

2. ReLU (in-place)

3. Batch Normalization Layer

64 features

4. Max pooling layer

Input: tensor of size 128x128x64
Parameters: none, perform 2×2 max-pooling with (2,2) stride
Output: tensor of size 64x64x64

5. Convolution Layer

Input: 64x64x64
Parameters: 128 features of size 5×5, stride of (1,1), padding=2
Output: tensor of size 64x64x128

6. ReLU (in-place)

7. Batch Normalization Layer

128 features

8. Max pooling layer

Input: tensor of size 64x64x128
Parameters: none, perform 2×2 max-pooling with (2,2) stride
Output: tensor of size 32x32x128

9. Convolution Layer

Input: 32x32x128
Parameters: 256 features of size 3×3, stride of (1,1), padding=1
Output: tensor of size 32x32x256

10. ReLU (in-place)

11. Batch Normalization Layer

256 features

12. Max pooling layer

Input: tensor of size 32x32x256
Parameters: none, perform 2×2 max-pooling with (2,2) stride
Output: tensor of size 16x16x256

13. Convolution Layer

Input: 16x16x256
Parameters: 512 features of size 3×3, stride of (1,1), padding=1
Output: tensor of size 16x16x512

14. ReLU (in-place)

15. Batch Normalization Layer

512 features

16. Flatten Layer

Input: tensor of size 16x16x512
Parameters: none, simply flatten the tensor into 1-D
Output: vector of size 16x16x512=131072 Note: this simple layer doesn’t exist in Pytorch. You’ll have to use view(), or implement it yourself.

17. Fully Connected Layer (aka Linear Layer)

Input: vector of size 131072
Parameters: fully-connected layer with 1024 nodes
Output: vector of size 1024

18. ReLU (in-place)

19. Batch Normalization Layer

1024 features

For this network C, a feature f is produced as the output of layer number 19, of size 1024.

Loss Function

a. Binary-classification with BCE: A train/test split (each) as a text file is provided with rows containing image files paths of two images and then a 1 if they are the same person and 0 if not. I will read in these images and train the network as follows:

Given the network topology described above, construct a siamese network that takes as input a pair of images into this shared network. Each image (from the pair) is run through C individually, producing f1 and f2. Combine f1 and f2 through concatenation to produce f12 (e.g., take the elements of f1 and then combine the elements of f2 to the end of f1 so that f12 has dimensions 2*1024=2048). Then take the output of f12 and add one more fully-connected layer to classify this as 0 or 1. This output layer will have 1 node (e.g., only one number is output) and should have a sigmoid activation (so that the outputs are clipped in the range [0.0, 1.0]). The output of this last layer then represents the probability that this pair of images is the same (it’ll be close to 1 if they are the same person and close to 0 if not).

b. Contrastive-loss: Now we will approach the problem in a slightly different way. Rather than attempt to directly classify the pair as 0 or 1, we will train the network to learn an embedding that attempts to bring features of same people closer to each other and separate apart people with different identities (e.g., the features representing all images of, say, “Brad Pitt” should be very close to each other, but far away from feature clusters of all other people). The way we will do this is to have the network learn a similarity metric. In other words, it’s reasonable to assume just taking the L2-distance between features would achieve this (as in K-Means), and so we will train the network to optimize this objective to yield the best embedding that groups features via L2-distance. In the end, we will train with what’s called contrastive loss, which is a custom loss function as follows. Suppose we define: Dw (I1 , I2 ) = || C(I1 ) − C (I2 ) ||2 = || f1 − f2 || 2 (e.g., the Euclidean distance between features produced by C, given the pair of input images). We then define our custom objective function L as:

L(I1 , I2 , y ) = y * Dw (I1 , I2 ) 2 + (1 − y ) * {max(0, m − D w (I 1 , I 2 ))} ^2

where I1 and I2 are the input image pairs, y is the ground truth label (1 if they are the same and 0 if they are different), and m is a margin term that nudges the network to separate features even more. You can start with a margin of 1.0, but feel free to use other margins. The network is trained to minimize this term. Use the same base network as before. The only difference here is instead of concatenating features and doing binary classification, we are training to enforce structure on the data (think of it as supervised clustering). Use the same train/test split as before.

Where is the code?

As always, check out my GitHub: