Knowing how to initialize model weights is an important topic in Deep Learning. The initial weights impact a lot of factors – the gradients, the output subspace, etc. In this article, we will learn about some of the most important and widely used weight initialization techniques and how to implement them using PyTorch. This article expects the user to have beginner-level familiarity with PyTorch.

## Why is it important to initialize model weights?

The goal of training any deep learning model is finding the optimum set of weights for the model that gives us the desired results. The training methods used in Deep Learning are generally iterative in nature and require us to provide an initial set of weights that needs to be updated over time.

The initial weights play a huge role in deciding the final outcome of the training. Wrong initialization of weights can lead to vanishing or exploding gradients, which is obviously unwanted. So we use some standard methods of initializing the layers, which we will be discussing in this article.

## The general rule of thumb

A rule of thumb is that the *“initial model weights need to be close to zero, but not zero”*. A naive idea would be to sample from a Distribution that is arbitrarily close to 0.

For example, you can choose to fill the weight with values sampled from U(-0.01, 0.01) or N(0, 0.01).

Turns out the above idea is not so naive at all, most of the standard methods are based on sampling from Uniform and Normal Distribution.

But the real trick lies in setting the boundary conditions for these distributions. One of the generally used boundary conditions is 1/sqrt(n), where n is the number of inputs to the layer.

In PyTorch, we can set the weights of the layer to be sampled from uniform or normal distribution using the `uniform_`

and `normal_`

functions. Here is a simple example of `uniform_()`

and `normal_()`

in action.

```
# Linear Dense Layer
layer_1 = nn.Linear(5, 2)
print("Initial Weight of layer 1:")
print(layer_1.weight)
# Initialization with uniform distribution
nn.init.uniform_(layer_1.weight, -1/sqrt(5), 1/sqrt(5))
print("\nWeight after sampling from Uniform Distribution:\n")
print(layer_1.weight)
# Initialization with normal distribution
nn.init.normal_(layer_1.weight, 0, 1/sqrt(5))
print("\nWeight after sampling from Normal Distribution:\n")
print(layer_1.weight)
```

**Output:**

```
Initial Weight of layer 1:
Parameter containing:
tensor([[-0.0871, -0.0804, 0.2327, -0.1453, -0.1019],
[-0.1338, -0.2465, 0.3257, -0.2669, -0.1537]], requires_grad=True)
Weight after sampling from Uniform Distribution:
Parameter containing:
tensor([[ 0.4370, -0.4110, 0.2631, -0.3564, 0.0707],
[-0.0009, 0.3716, -0.3596, 0.3667, 0.2465]], requires_grad=True)
Weight after sampling from Normal Distribution:
Parameter containing:
tensor([[-0.2148, 0.1156, 0.7121, 0.2840, -0.4302],
[-0.2647, 0.2148, -0.0852, -0.3813, 0.6983]], requires_grad=True)
```

But there are also some limitations to this method. These methods are a bit too generalized and tend to be a little problematic for layers having non-linear activation functions such as `Sigmoid`

, `Tanh`

and `ReLU`

activations, where there is a high chance of vanishing and exploding gradients.

So in the next section we explore some of the advanced methods that have been proposed to tackle this problem.

## Initialization of layers with non-linear activation

There are two standard methods for weight initialization of layers with non-linear activation- The Xavier(Glorot) initialization and the Kaiming initialization.

We will not dive into the mathematical expression and proofs but focus more on where to use them and how to apply them. This is absolutely not an invitation to skip the mathematical background.

### 1. Xavier Initialization

Xavier initialization is used for layers having `Sigmoid`

and `Tanh`

activation functions. There are two different versions of Xavier Initialization. The difference lies in the distribution from where we sample the data – the Uniform Distribution and Normal Distribution. Here is a brief overview of the two variations:

### 2. **Xavier Uniform Distribution**

In this method the weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,

`input_dim`

and the `output_dim`

are output and input dimension, or more explicitly the dimensions of the previous and preceding layer, and `gain`

is simply a scaling factor.

**Example:**

```
# The convolution layer
conv_layer = nn.Conv2d(1, 4, (2,2))
# Initiliazing with Xavier Uniform
nn.init.xavier_uniform_(conv_layer.weight)
```

### 3. Xavier Normal Distribution

This method is similar to the previous one, except the fact that the values are sampled from the normal distribution where,

and `input_dim`

and the `output_dim`

are output and input dimension, or more explicitly the dimensions of the previous and preceding layer.

**Example:**

```
# The convolution layer
conv_layer = nn.Conv2d(1, 4, (2,2))
# Initiliazing with Xavier Normal
nn.init.xavier_normal_(conv_layer.weight)
```

## Kaiming Initialization

So far we have discussed how to initialize weights when the layer has `sigmoid`

and `Tanh`

activation function. We have not yet discussed about `ReLU`

.

The layers with `ReLU`

activation function was once initialized using the Xavier method until Kaiming proposed his method for initializing layers `ReLU`

activation functions. Kaiming is a bit different from Xavier initialization is only in the mathematical formula for the boundary conditions.

The PyTorch implementation of Kaming deals with not with ReLU but also but also LeakyReLU. PyTorch offers two different modes for kaiming initialization – the fan_in mode and fan_out mode. Using the fan_in mode will ensure that the data is preserved from exploding or imploding. Similiarly fan_out mode will try to preserve the gradients in back-propogation.

### 1. Kaiming Uniform distribution

The weight tensor is filled with values are sampled from the the Uniform distribution U(-a, a) where,

For fan_in mode, the input dimensions are used, whereas for fan_out mode the output dimensions are used. The gain for ReLU is √2 and LeakyReLu is √(1/a^2 +1).

The gain is usually taken care of by the `kaiming_uniform_()`

and `kaiming_normal_()`

functions, where we need to specify only the type of non-linearity we are dealing with.

**Example:**

```
conv_layer = nn.Conv2d(1, 4, (2,2))
nn.init.kaiming_uniform_(conv_layer.weight, mode='fan_in', nonlinearity='relu')
```

### 2. Kaiming Normal Distribution

The layer weights are sampled form the normal distribution where,

and input_dim and the output_dim are output and input dimension and are selected on the choice of operating mode.

**Example:**

```
conv_layer = nn.Conv2d(1, 4, (2,2))
nn.init.kaiming_normal_(conv_layer.weight, mode='fan_in', nonlinearity='relu')
```

## Integrating the initializing rules in your PyTorch Model

Now that we are familiar with how we can initialize single layers using PyTorch, we can try to initialize layers of real-life PyTorch models. We can do this initialization in the model definition or apply these methods after the model has been defined.

### 1. Initializing when the model is defined

```
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
# Layer definitions
super().__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
# Initialization
nn.init.kaiming_normal_(self.fc1.weight, mode='fan_in',
nonlinearity='relu')
nn.init.kaiming_normal_(self.fc2.weight, mode='fan_in',
nonlinearity='relu')
nn.init.xavier_normal_(self.fc3.weight)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 5 * 5)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
x = nn.sigmoid(x)
return x
# Every time you create a new mode, it will have a weight initialized model
net = Net()
```

### 2. Initializing after the model is created

You can always alter the weights after the model is created, you can do this by defining a rule for the particular type of layers and applying it on the whole model, or just by initializing a single layer.

```
# Defining a method for initialization of linear weights
# The initialization will be applied to all linear layers
# irrespective of their activation function
def init_weights(m):
if type(m) == nn.Linear:
torch.nn.init.xavier_uniform(m.weight)
# Applying it to our net
net.apply(init_weights)
```

```
# Create the model
net = Net()
# Apply the Xavier normal method to the last layer
nn.init.xavier_normal_(self.fc3.weight)
```

## Conclusion

This brings us to the end of this article on weight initialization. Stay tuned for more such articles on deep learning and PyTorch.