Knowing how to initialize model weights is an important topic in Deep Learning. The initial weights impact a lot of factors – the gradients, the output subspace, etc. In this article, we will learn about some of the most important and widely used weight initialization techniques and how to implement them using PyTorch. This article expects the user to have beginner-level familiarity with PyTorch.
Why is it important to initialize model weights?
The goal of training any deep learning model is finding the optimum set of weights for the model that gives us the desired results. The training methods used in Deep Learning are generally iterative in nature and require us to provide an initial set of weights that needs to be updated over time.
The initial weights play a huge role in deciding the final outcome of the training. Wrong initialization of weights can lead to vanishing or exploding gradients, which is obviously unwanted. So we use some standard methods of initializing the layers, which we will be discussing in this article.
The general rule of thumb
A rule of thumb is that the “initial model weights need to be close to zero, but not zero”. A naive idea would be to sample from a Distribution that is arbitrarily close to 0.
For example, you can choose to fill the weight with values sampled from U(-0.01, 0.01) or N(0, 0.01).
Turns out the above idea is not so naive at all, most of the standard methods are based on sampling from Uniform and Normal Distribution.
But the real trick lies in setting the boundary conditions for these distributions. One of the generally used boundary conditions is 1/sqrt(n), where n is the number of inputs to the layer.
In PyTorch, we can set the weights of the layer to be sampled from uniform or normal distribution using the
normal_ functions. Here is a simple example of
normal_() in action.
# Linear Dense Layer layer_1 = nn.Linear(5, 2) print("Initial Weight of layer 1:") print(layer_1.weight) # Initialization with uniform distribution nn.init.uniform_(layer_1.weight, -1/sqrt(5), 1/sqrt(5)) print("\nWeight after sampling from Uniform Distribution:\n") print(layer_1.weight) # Initialization with normal distribution nn.init.normal_(layer_1.weight, 0, 1/sqrt(5)) print("\nWeight after sampling from Normal Distribution:\n") print(layer_1.weight)
Initial Weight of layer 1: Parameter containing: tensor([[-0.0871, -0.0804, 0.2327, -0.1453, -0.1019], [-0.1338, -0.2465, 0.3257, -0.2669, -0.1537]], requires_grad=True) Weight after sampling from Uniform Distribution: Parameter containing: tensor([[ 0.4370, -0.4110, 0.2631, -0.3564, 0.0707], [-0.0009, 0.3716, -0.3596, 0.3667, 0.2465]], requires_grad=True) Weight after sampling from Normal Distribution: Parameter containing: tensor([[-0.2148, 0.1156, 0.7121, 0.2840, -0.4302], [-0.2647, 0.2148, -0.0852, -0.3813, 0.6983]], requires_grad=True)
But there are also some limitations to this method. These methods are a bit too generalized and tend to be a little problematic for layers having non-linear activation functions such as
ReLU activations, where there is a high chance of vanishing and exploding gradients.
So in the next section we explore some of the advanced methods that have been proposed to tackle this problem.
Initialization of layers with non-linear activation
There are two standard methods for weight initialization of layers with non-linear activation- The Xavier(Glorot) initialization and the Kaiming initialization.
We will not dive into the mathematical expression and proofs but focus more on where to use them and how to apply them. This is absolutely not an invitation to skip the mathematical background.
1. Xavier Initialization
Xavier initialization is used for layers having
Tanh activation functions. There are two different versions of Xavier Initialization. The difference lies in the distribution from where we sample the data – the Uniform Distribution and Normal Distribution. Here is a brief overview of the two variations:
2. Xavier Uniform Distribution
In this method the weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,
input_dim and the
output_dim are output and input dimension, or more explicitly the dimensions of the previous and preceding layer, and
gain is simply a scaling factor.
# The convolution layer conv_layer = nn.Conv2d(1, 4, (2,2)) # Initiliazing with Xavier Uniform nn.init.xavier_uniform_(conv_layer.weight)
3. Xavier Normal Distribution
This method is similar to the previous one, except the fact that the values are sampled from the normal distribution where,
input_dim and the
output_dim are output and input dimension, or more explicitly the dimensions of the previous and preceding layer.
# The convolution layer conv_layer = nn.Conv2d(1, 4, (2,2)) # Initiliazing with Xavier Normal nn.init.xavier_normal_(conv_layer.weight)
So far we have discussed how to initialize weights when the layer has
Tanh activation function. We have not yet discussed about
The layers with
ReLU activation function was once initialized using the Xavier method until Kaiming proposed his method for initializing layers
ReLU activation functions. Kaiming is a bit different from Xavier initialization is only in the mathematical formula for the boundary conditions.
The PyTorch implementation of Kaming deals with not with ReLU but also but also LeakyReLU. PyTorch offers two different modes for kaiming initialization – the fan_in mode and fan_out mode. Using the fan_in mode will ensure that the data is preserved from exploding or imploding. Similiarly fan_out mode will try to preserve the gradients in back-propogation.
1. Kaiming Uniform distribution
The weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,
For fan_in mode, the input dimensions are used, whereas for fan_out mode the output dimensions are used. The gain for ReLU is √2 and LeakyReLu is √(1/a^2 +1).
The gain is usually taken care of by the
kaiming_normal_() functions, where we need to specify only the type of non-linearity we are dealing with.
conv_layer = nn.Conv2d(1, 4, (2,2)) nn.init.kaiming_uniform_(conv_layer.weight, mode='fan_in', nonlinearity='relu')
2. Kaiming Normal Distribution
The layer weights are sampled form the normal distribution where,
and input_dim and the output_dim are output and input dimension and are selected on the choice of operating mode.
conv_layer = nn.Conv2d(1, 4, (2,2)) nn.init.kaiming_normal_(conv_layer.weight, mode='fan_in', nonlinearity='relu')
Integrating the initializing rules in your PyTorch Model
Now that we are familiar with how we can initialize single layers using PyTorch, we can try to initialize layers of real-life PyTorch models. We can do this initialization in the model definition or apply these methods after the model has been defined.
1. Initializing when the model is defined
import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): # Layer definitions super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.pool = nn.MaxPool2d(2, 2) self.conv2 = nn.Conv2d(6, 16, 5) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10) # Initialization nn.init.kaiming_normal_(self.fc1.weight, mode='fan_in', nonlinearity='relu') nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in', nonlinearity='relu') nn.init.xavier_normal_(self.fc3.weight) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = x.view(-1, 16 * 5 * 5) x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) x = self.fc3(x) x = nn.sigmoid(x) return x # Every time you create a new mode, it will have a weight initialized model net = Net()
2. Initializing after the model is created
You can always alter the weights after the model is created, you can do this by defining a rule for the particular type of layers and applying it on the whole model, or just by initializing a single layer.
# Defining a method for initialization of linear weights # The initialization will be applied to all linear layers # irrespective of their activation function def init_weights(m): if type(m) == nn.Linear: torch.nn.init.xavier_uniform(m.weight) # Applying it to our net net.apply(init_weights)
# Create the model net = Net() # Apply the Xavier normal method to the last layer nn.init.xavier_normal_(self.fc3.weight)
This brings us to the end of this article on weight initialization. Stay tuned for more such articles on deep learning and PyTorch.