How to Initialize Model Weights in Pytorch

In the world of deep learning, the process of initializing model weights plays a crucial role in determining the success of a neural network’s training. PyTorch, a popular open-source deep learning library, offers various techniques for weight initialization, which can significantly impact the model’s learning efficiency and convergence speed.

A well-initialized model can lead to faster convergence, improved generalization, and a more stable training process. In this article, we’ll explore different weight initialization techniques available in PyTorch, discuss their benefits and drawbacks, and provide a step-by-step guide on how to implement them in your deep learning project.

Importance of weight initialization in Deep Learning

Initializing model weights is important in deep learning. It influences aspects such as gradients and the output subspace. PyTorch provides numerous strategies for weight initialization, including methods like drawing samples from uniform and normal distributions, as well as sophisticated approaches such as Xavier (Glorot) initialization and Kaiming initialization. Xavier initialization is employed for layers that utilize Sigmoid and Tanh activation functions, while Kaiming initialization is tailored for layers with ReLU activation functions. Incorporating these weight initialization techniques into your PyTorch model can lead to enhanced training results and superior model performance.

The goal of training any deep learning model is finding the optimum set of weights for the model that gives us the desired results. The training methods used in Deep Learning are generally iterative in nature and require us to provide an initial set of weights that needs to be updated over time.

The initial weights play a huge role in deciding the final outcome of the training. Incorrect initialization of weights can lead to vanishing or exploding gradients, which is obviously unwanted. So we use some standard methods of initializing the layers, which we will be discussing in this article.

The general rule of thumb

A rule of thumb is that the “initial model weights need to be close to zero, but not zero”. A naive idea would be to sample from a Distribution that is arbitrarily close to 0.

For example, you can choose to fill the weight with values sampled from U(-0.01, 0.01) or N(0, 0.01).

Turns out, this idea is quite effective, most of the standard methods are based on sampling from Uniform and Normal Distribution.

But the real trick lies in setting the boundary conditions for these distributions. One of the generally used boundary conditions is 1/sqrt(n), where n is the number of inputs to the layer.

In PyTorch, we can set the weights of the layer to be sampled from uniform or normal distribution using the uniform_ and normal_ functions. Here is a simple example of uniform_() and normal_() in action.

# Linear Dense Layer
layer_1 = nn.Linear(5, 2)
print("Initial Weight of layer 1:")
print(layer_1.weight)

# Initialization with uniform distribution
nn.init.uniform_(layer_1.weight, -1/sqrt(5), 1/sqrt(5))
print("\nWeight after sampling from Uniform Distribution:\n")
print(layer_1.weight)

# Initialization with normal distribution
nn.init.normal_(layer_1.weight, 0, 1/sqrt(5))
print("\nWeight after sampling from Normal Distribution:\n")
print(layer_1.weight)

Output:

Initial Weight of layer 1:
Parameter containing:
tensor([[-0.0871, -0.0804,  0.2327, -0.1453, -0.1019],
        [-0.1338, -0.2465,  0.3257, -0.2669, -0.1537]], requires_grad=True)

Weight after sampling from Uniform Distribution:

Parameter containing:
tensor([[ 0.4370, -0.4110,  0.2631, -0.3564,  0.0707],
        [-0.0009,  0.3716, -0.3596,  0.3667,  0.2465]], requires_grad=True)

Weight after sampling from Normal Distribution:

Parameter containing:
tensor([[-0.2148,  0.1156,  0.7121,  0.2840, -0.4302],
        [-0.2647,  0.2148, -0.0852, -0.3813,  0.6983]], requires_grad=True)

But there are also some limitations to this method. These methods are a bit too generalized and tend to be a little problematic for layers having non-linear activation functions such as Sigmoid, Tanh and ReLU activations, where there is a high chance of vanishing and exploding gradients.

So in the next section we explore some of the advanced methods that have been proposed to tackle this problem.

Initialization of layers with non-linear activation

There are two standard methods for weight initialization of layers with non-linear activation- The Xavier(Glorot) initialization and the Kaiming initialization.

We will not dive into the mathematical expression and proofs but focus more on where to use them and how to apply them. However, understanding the mathematical background is beneficial.

1. Xavier Initialization

Xavier initialization is used for layers having Sigmoid and Tanh activation functions. There are two different versions of Xavier Initialization. The difference lies in the distribution from where we sample the data – the Uniform Distribution and Normal Distribution. Here is a brief overview of the two variations:

2. Xavier Uniform Distribution

In this method the weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,

input_dim and the output_dim are output and input dimension, or more explicitly the dimensions of the previous and preceding layer, and gain is simply a scaling factor.

Example:

# The convolution layer
conv_layer = nn.Conv2d(1, 4, (2,2))

# Initiliazing with Xavier Uniform 
nn.init.xavier_uniform_(conv_layer.weight)

3. Xavier Normal Distribution

This method is similar to the previous one, except the fact that the values are sampled from the normal distribution where,

and input_dim and the output_dim are output and input dimension, or more explicitly the dimensions of the previous and preceding layer.

Example:

# The convolution layer
conv_layer = nn.Conv2d(1, 4, (2,2))

# Initiliazing with Xavier Normal
nn.init.xavier_normal_(conv_layer.weight)

Kaiming Initialization

So far we have discussed how to initialize weights when the layer has sigmoid and Tanh activation function. We have not yet discussed about ReLU.

The layers with ReLU activation function was once initialized using the Xavier method until Kaiming proposed his method for initializing layers ReLU activation functions. Kaiming is a bit different from Xavier initialization is only in the mathematical formula for the boundary conditions.

The PyTorch implementation of Kaming deals with not with ReLU but also but also LeakyReLU. PyTorch offers two different modes for kaiming initialization – the fan_in mode and fan_out mode. Using the fan_in mode will ensure that the data is preserved from exploding or imploding. Similiarly fan_out mode will try to preserve the gradients in back-propogation.

1. Kaiming Uniform distribution

The weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,

For fan_in mode, the input dimensions are used, whereas for fan_out mode the output dimensions are used. The gain for ReLU is √2 and LeakyReLu is √(1/a^2 +1).

The gain is usually taken care of by the kaiming_uniform_() and kaiming_normal_() functions, where we need to specify only the type of non-linearity we are dealing with.

Example:

conv_layer = nn.Conv2d(1, 4, (2,2))

 nn.init.kaiming_uniform_(conv_layer.weight, mode='fan_in', nonlinearity='relu')

2. Kaiming Normal Distribution

The layer weights are sampled form the normal distribution where,

and input_dim and the output_dim are output and input dimension and are selected on the choice of operating mode.

Example:

conv_layer = nn.Conv2d(1, 4, (2,2))

 nn.init.kaiming_normal_(conv_layer.weight, mode='fan_in', nonlinearity='relu')

Integrating Weight Initialization Rules in Your PyTorch Model

Now that we are familiar with how we can initialize single layers using PyTorch, we can try to initialize layers of real-life PyTorch models. We can do this initialization in the model definition or apply these methods after the model has been defined.

1. Initializing when the model is defined

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        # Layer definitions
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

        # Initialization
        nn.init.kaiming_normal_(self.fc1.weight, mode='fan_in', 
                                 nonlinearity='relu')
        nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in', 
                                 nonlinearity='relu')
        nn.init.xavier_normal_(self.fc3.weight)
        

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        x = nn.sigmoid(x)
        return x

# Every time you create a new mode, it will have a weight initialized model
net = Net()

2. Initializing after the model is created

You can always alter the weights after the model is created, you can do this by defining a rule for the particular type of layers and applying it on the whole model, or just by initializing a single layer.

# Defining a method for initialization of linear weights
# The initialization will be applied to all linear layers
# irrespective of their activation function

def init_weights(m):
    if type(m) == nn.Linear:
        torch.nn.init.xavier_uniform(m.weight)

# Applying it to our net
net.apply(init_weights)

# Create the model
net = Net()

# Apply the Xavier normal method to the last layer
nn.init.xavier_normal_(self.fc3.weight)

Conclusion and Further Exploration

Having gained knowledge on the significance of initializing weights and the diverse methods accessible in PyTorch, can you consider additional tactics to enhance the efficacy of neural network training? Continue to investigate and try out new approaches!

References: