Gradient descent algorithm with implementation from scratch

In this article, we will learn about one of the most important algorithms used in all kinds of machine learning and neural network algorithms with an example where we will implement gradient descent algorithm from scratch in python. So without further ado, let us start.

What is Gradient Descent?

Gradient descent is an optimization algorithm used to minimize a cost function in machine learning and deep learning models. It is a first-order optimization algorithm that iteratively adjusts the parameters of a model to reduce the cost. The algorithm works by computing the gradient of the cost function with respect to the parameters and updating the parameters in the direction of the steepest decrease in the cost. Before going into the mathematical part first let us understand what is cost function for any algorithm.

Cost Function:

In machine learning, a cost function (also known as a loss function or error function) is a mathematical function that measures the difference between the predicted values and the actual values for a given set of input/output pairs. The goal of the machine learning algorithm is to find the parameters that minimize the cost function. For example, in linear regression, the mean squared error (MSE) is often used as the cost function:

Another very popular metric is RMSE which basically the root of MSE, this is used to determine the performance of model, lower the value of RMSE, higher is the accuracy of predictions made by our model.

Mathematical understanding Gradient descent

The basic idea behind gradient descent is to start with an initial set of parameters and then iteratively adjust them in the direction of the steepest decrease in the cost until we reach a minimum. The size of the adjustment is determined by the learning rate(α), which determines how much the parameters are updated in each iteration. Let us understand this through an example using MSE:

We know that given the input features x and output labels y, the linear regression model is defined as given below where w is a vector of parameters and b is a bias term:

linear regressor

The mean squared error cost function is defined as given below where m is the number of samples and y_{pred,i} is the predicted output for the ith sample:

The gradient of the cost with respect to the parameters w can be calculated as follows:

And the gradient of the cost with respect to the bias term b can be calculated as follows:

Finally, the parameters w and b can be updated using the gradient descent rule as given below where where α is the learning rate, which determines the step size of the update:

Note: A smaller learning rate will lead to slower convergence, but a higher learning rate may cause the algorithm to overshoot the minimum and not converge.

Types of Gradient Descent

Batch gradient descent computes the gradient of the cost with respect to the parameters for the entire training dataset in each iteration.
Stochastic gradient descent computes the gradient of the cost with respect to a single randomly selected training example in each iteration.
Mini-batch gradient descent computes the gradient of the cost with respect to a small randomly selected subset of the training examples in each iteration.

Mini-batch GD is most commonly used because full batch learners must perform the full dataset scan for every single weight update whereas mini-batch learners get to perform that same weight update multiple times per dataset scan which results in multiplicatively faster training

Implementing Gradient Descent from Scratch in Python

Let’s consider an example of linear regression with a single input feature to illustrate the gradient descent algorithm. The cost function for linear regression is the mean squared error (MSE), which is given by:

MSE(w, b) = 1/N * sum((y_pred - y_true)^2)

where w and b are the parameters of the linear regression model, y_pred is the predicted output, y_true is the actual output, and N is the number of training examples. To implement gradient descent, we need to compute the gradient of the cost function with respect to the parameters w and b.

dw = 1/N * 2 * sum((y_pred - y_true) * x)
db = 1/N * 2 * sum(y_pred - y_true)

Next, we can update the parameters using the gradient and the learning rate alpha.

w = w - alpha * dw
b = b - alpha * db

Complete code for implementation of the gradient descent algorithm for linear regression in Python

import numpy as np

def gradient_descent(x, y, alpha, num_iters):
    m = x.shape[0]
    w = np.zeros(x.shape[1])
    b = 0
    for i in range(num_iters):
        y_pred = np.dot(x, w) + b
        dw = (1/m) * np.dot(x.T, (y_pred - y))
        db = (1/m) * np.sum(y_pred - y)
        w = w - alpha * dw
        b = b - alpha * db
    return w, b

Summary

We have understood the mathematics involved in gradient descent algorithm as well as its implementation in python from scratch. Feel free to drop a comment for any queries.

References: