ROC curves in Machine Learning

ROC Curves

The ROC curve stands for Receiver Operating Characteristic curve. ROC curves display the performance of a classification model.

ROC tells us how good the model is for distinguishing between the given classes, in terms of the predicted probability.

In this article, we will understand ROC curves, what is AUC, and implement a binary classification problem to understand how to plot the ROC curve for a model.

Introduction to ROC Curves

Suppose we have a Logistic regression model that classifies an event as True or False. We know that the default threshold value for classifying a point as True or False is 0.5 in Logistic regression but we can alter this threshold value to match according to our need.

So, the ROC curve is a plot of the false positive rate (FPR) (x-axis) vs. the true positive rate(TPR) (y-axis) for a number of different candidate threshold values between 0.0 and 1.0.

Understanding TPR and FPR

As mentioned, a ROC curve is dependent on True Positive Rate and False Positive Rate let’s see what they are.

True Positive Rate: The true positive rate is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives.

True Positive Rate = True Positives / (True Positives + False Negatives)

False Positive Rate: The false-positive rate is calculated as the number of false positives divided by the sum of the number of false positives and the number of true negatives.

False Positive Rate = False Positives / (False Positives + True Negatives)

For different threshold values we will get different TPR and FPR.

ROC Curve
ROC Curve

Why do we use ROC Curves?

ROC Curves are useful for the following reasons:

  • The area under the curve (AUC) can be used as an indicator of the performance of the model.
  • Different models can be compared against each other based on their ROC curves.

To get the best model we want to increase our True Positive Rate and Reduce our False Positive Rate (TPR = 1, FPR = 0).

This means that our model will be able to separate the classes correctly. Such models are known as skillful models. In real life, this is never achieved.

A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right (Blue line in the above figure). Such models have AUC 0.5. Such models have equal TPR and FPR for every value of the threshold.

Plotting ROC Curves in Python

Let’s now build a binary classifier and plot it’s ROC curve to better understand the process.

We will use a Logistic Regression model for this example. We’re working with three important libraries here – Matplotlib, Numpy, and sklearn.

#Importing Required Modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from  sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

##Creating a Dataset
X, label = make_classification(n_samples=500, n_classes=2, weights=[1,1], random_state=100)

#Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, label, test_size=0.3, random_state=1)

#Creating the class object and 
model = LogisticRegression()
model.fit(X_train, y_train)

#predict probabilities
probs = model.predict_proba(testX)

#Keeping only positive class
probs = probs[:, 1]

#Calculating the FPR and TPR
fpr, tpr, thresholds = roc_curve(testy, probs)

#Plotting the figure
plt.figure(figsize = (10,6))
plt.plot(fpr, tpr, color='red', label='ROC')
plt.plot([0, 1], [0, 1], color='darkblue', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic Curve')
plt.legend()
plt.show()
ROC Curve Of Logistic Regression Model
ROC Curve Of Logistic Regression Model

The sklearn module provides us with roc_curve function that returns False Positive Rates and True Positive Rates as the output.

This function takes in actual probabilities of both the classes and a the predicted positive probability array calculated using .predict_proba( ) method of LogisticRegression class.

There you go, now we know how to plot ROC curve for a binary classification model.

Conclusion

In this article we gained some information about ROC curves and why it is important. We also got some idea about True Positive Rates and False Positive Rates and how ROC curves are dependent on them. Finally we looked into the code to plot ROC curves for a Logistic Regression model.

Happy Learning!