Logistic Regression in Predictive Analytics: A Comprehensive Guide

Logistic Regression In Python

Logistic regression is a kind of statistical model that is used for predictive analytics and classification tasks. In statistics, logistic regression is used to predict the probability of an event happening which is mainly in binary, that is 0s and 1s. The S-shaped logistic regression curve gives the idea behind this model. It represents how the model can take any numerical value and map it to any of the two outcomes, that is 0 or 1.

It is used when we cannot find a linear relationship between the dependent and independent parameters. It doesn’t predict a numerical value rather it classifies the output as either 0 or 1.

The logistic regression function also called the sigmoid function, is used to convert any numerical value between 0 and 1. For negative inputs, the sigmoid function approaches 0 which indicates a low probability of an event. In contrast, if the input is positive, the output goes closer to 1, which indicates a higher probability of the event occurring.

Suggested: What Are the Different Types of Classification Algorithms?

Exploring the Odds Ratio

The odds ratio is defined as the odds of an event happening upon the odds of it not happening. Interpreting the odds ratio in a logistic regression model helps us to determine the degree or strength of association between the input and the output. The odds ratio can be mathematically be represented as:

OR = P(Y=1)/(1-P(Y=1))

where,

  • P(Y=1) is the probability of Y occurring
  • and, 1- P(Y) is the probability of Y not occurring.

In logistic regression, the odds ratio is very important as we will be using it in the later sections to understand the model. It is used to assess the degree and magnitude of association of the input to the output.

The natural logarithm of the Odds Ratio is called the logit function.

Understanding the Sigmoid Function in Logistic Regression

The equation of the sigmoid function is as follows:

f(x)= 1/(1+ e-z )

Let’s break down the components of the equation one by one:

  • e: It is the natural logarithm whose value is 2.718(approx)
  • z = b0 + b1x1, where b0 is the intercept and b1 is the slope.
  • If we take the natural.logarithm of the above function we get = bo + b1x1.
  • The denominator (1+ e-z ) ensures that my denominator is always positive. And the reciprocal of that will always lie between 0 and 1.
  • If the value of the function is greater than or equal to 0.5, then the instance is classified as 1.

The sigmoid function looks like this:

The Sigmoid Function
The Sigmoid Function

Dataset Overview

We will be using the fake bills dataset from the UCI repository. In this dataset we have 1000 instances of a real dollar bill and 500 instances of a fake dollar bill. We have to classify if our dollar bills as “genuine” or “not”. We have 5 parameters based on which we will have to reach this conclusion. They are:

Dataset Fake Bills
Dataset Fake Bills
  • Margin_low: Which is the measurement of the lower margin of the dollar bill. (Float in nature)
  • Margin_up: Which is the measurement of the upper margin of the dollar bill. (Float in nature)
  • Diagonal: Which is the diagonal measurement of the dollar bill.(Float in nature)
  • Length: Which is the total length of the dollar bill.(float in nature)
  • Height_left: Which is the height of the left side of the dollar bill.(Float in nature)
  • Height_right: Which is the height of the right side of the dollar bill.(Float in nature)

All these are the independent variables in our dataset, we have to evaluate their effect on the variable called “is_genuine” . The variabe “is_genuine” is our dependent, target variable which is boolean in nature, that is, it can take up only to values “yes” and “no”. If the dollar bill is classified as real, this variable will have a value of yes else no.

You can find the dataset here.

Python Implementation Steps

In this section we will delve right into implementing the logistic regression framework using python. First we will need to import the required modules. We will be needing the pandas library, the sklearn library and also the math module.

You can import these modules in the following manner:

#lrr
# Importing the libraries needed for running the Logistic Regression Model
print("Logistic Regression classification=")
import pandas as pd
from math import sqrt
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

A lot of modules and functions have been imported in the above section which we will go through one by one. Pandas are used for defining arrays and scientific calculation, the square root function from the math module will also be required which is why it is imported, the statsmodelapi which is used for statistical modeling and analysis, from the sklearn library we are also using the logistic regression module for our program, the train_test_split function is used to split a dataset into training and testing sets and finally we will be needing our accuracy score to see how accurate our model is.

The test train split is used to evaluate the model. The dataset is separated into a training and a testing set. The training set is used to train the model and the testing set is used to evaluate the functioning of the model. We will use the hold-out method for the train-test data split. In this method, we specify a percentage of the dataset as the training set and the rest is used for testing. For example, we will be using 20% of the dataset for testing purposes, and the rest 80% will be used for training.

In the next section, we will import our dataset into the program.

data = pd.read_csv("/content/fake_bills (1).csv", delimiter=";")

# Drop rows with missing values
data.dropna(inplace=True)

# Split the data into features (X) and the target variable (y)
x = data[['diagonal', 'height_left', 'height_right', 'margin_low', 'margin_up', 'length']]
y = data['is_genuine']

Now we will split the dataset into training and test sets.

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
#test size is 20% of the total dataset

Now we will train our model and print the summary of our results.

# Train logistic regression model
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(x_train, y_train)
log_reg = sm.Logit(y, x).fit()
print(log_reg.summary())

The summary would be:

Logistic Regression classification=
Optimization terminated successfully.
         Current function value: 0.027098
         Iterations 12
                           Logit Regression Results                           
============================================================================
Dep. Variable:             is_genuine   No. Observations:                 1463
Model:                          Logit   Df Residuals:                     1457
Method:                           MLE   Df Model:                            5
Date:                Tue, 23 Jan 2024   Pseudo R-squ.:                  0.9576
Time:                        09:49:24   Log-Likelihood:                -39.644
converged:                       True   LL-Null:                       -934.20
Covariance Type:            nonrobust   LLR p-value:                     0.000
============================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
diagonal        -0.4755      0.727     -0.654      0.513      -1.901       0.950
height_left     -1.5227      1.053     -1.446      0.148      -3.587       0.541
height_right    -3.4686      1.145     -3.030      0.002      -5.712      -1.225
margin_low      -6.0609      0.993     -6.103      0.000      -8.007      -4.115
margin_up      -10.4068      2.183     -4.768      0.000     -14.685      -6.129
length           5.8826      0.874      6.734      0.000       4.170       7.595
================================================================================

Possibly complete quasi-separation: A fraction 0.53 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.

In the next section, let’s interpret the results.

Recommended Reading: Applied Predictive Modeling in Python.

Analyzing Model Summary

  • Our optimization terminated successfully message indicates that the model was able to reach a stable solution.
  • The low current function value represents the log-likelihood value at the final iteration of the optimization. A lower value of log-likelihood indicates a better fit to the data.
  • The number of iterations indicates how many times the algorithm iterated to converge to a solution.

Now we will interpret the logit regression results:

  • Each coefficient represents the change in the log odds of the dependent variable for a one-unit change in the corresponding predictor variable (is_genuine).
  • For example, the coefficient for “diagonal” is -0.4755, suggesting that a one-unit increase in “diagonal” is associated with a decrease of approximately 0.4755 in the log odds of the event (is_genuine).
  • The p-values: The variables which have a p-value of 0.05, are considered statistically insignificant, hence from the above results the two statistically insignificant variables are diagonal and height_left.

The message at the very end gives a warning about quasi separation suggesting that the model might be facing some limitations because some combinations of the predictor variables are predicting the outcome variable. This might lead to problems with parameter identification.

Further scope

In this article, we have seen how we can easily implement the logistic regression model in Python and perform classification tasks easily. You can also use the accuracy_score function to predict the accuracy of the model in the following way:

# Predictions on the test set
y_pred = logreg_model.predict(x_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

The output would be:

Accuracy: 1.0000

In this case, we have an accuracy score of 1.00 which is 100% accuracy, you can also use the confusion matrix, the f1 score, or precision and recall to gauge how good your model is. All of these are metrics scores and you can find their documentation here!

Further, you can fine-tune your model by removing the insignificant parameters from the predictor variables and see how the model metrics change as compared to the old ones. Note that this dataset has been able to give you a perfect score of 100% accuracy with zero false positives and zero false negatives. This is very rare and there might be some factors in the dataset that make this model so robust. Most of the time your model’s accuracy will vary with changing parameters and hyper tuning. To know more about logistic regression click here!