ROC-AUC Score: A Key Metric for ML Model Performance

The ROC-AUC score tells us how well a machine learning model can separate things into different groups. ROC-AUC stands for “Receiver Operating Characteristic – Area Under Curve”.

The score is between 0.5 and 1. A score of 0.5 means the model is guessing randomly. A score of 1 means it perfectly separates the groups every time. Higher scores mean the model is better at telling the groups apart.

We use this score to test models that have to make choices between 2 options, like:

Is this picture a dog or cat?
Will this person like this movie?
Is this email spam or not spam?

Using the ROC-AUC score, it becomes clear if our model is working well or if we need to improve it. The score lets us compare different models to pick the best one. It also shows what types of errors the model makes so we can fix them. Using ROC-AUC helps us build models that make good predictions.

Let’s understand what the score is how we can calculate it.

Diving Deeper into ROC-AUC

The receiver operating characteristic Area Under Curve(The ROC-AUC score) is a graph showing the true positive (TP) rate vs the false positive(FP) rate at various classification thresholds. It tells us how well our model can distinguish between positive and negative classes.

A threshold is used to convert numerical probability values into classification labels. For example, if the probability of an email being spam is 0.6 and our threshold for the model is 0.5, then the email will be classified as spam or true(1), if the probability was 0.4, it would have been classified as not spam or false(0).

The threshold in a binary problem with two classes sets a particular numerical value above which our test set should be classified in one class but if it is less than the threshold, it will be classified in the other class.

The ROC-AUC score can take any value from 0 to 1. Since the ROC curve plots the true positive rate against the false positive at various threshold values, we will simplify our understanding with the help of a confusion matrix.

A confusion matrix is used to assess the performance of a model on a binary classification problem. It is a 4×4 matrix, showing the number of true positives, false positives, false negatives, and true negatives.

The formulae for calculating the true positive rate and the false positive rate is given below:

True positive rate(TPR)= True positive(tp)/ [true positive(tp) + false negative(fn)]
false positive rate(FPR)= false positive(fp)/ [false positive(fp)+ true negative(tn)]

Implementing and Interpreting ROC-AUC with Python

We will take a simple example to understand the receiver operating characteristic area under curve metric for the performance of our model. Here, I am using a very simple example using arbitrary data but you can implement the same code for your projects where you will be using your own data. The size of the data set doesn’t matter, this method will work regardless. You will just have to replace my data with your dataset and it would give you the desired results.

Let’s get started.

Let’s say we are trying to test our model on how good it is in identifying spam emails from not spam emails. In the table given below, we have the actual labels and their respective probabilities of being spam. We have 7 instances of emails, where 3 emails are spam, labeled as 1 and the other four are marked 0 which indicates that they aren’t spam.

Before we plot our roc curve, we have to calculate our true positive and false positive rates using the formulae given above at different thresholds:

At threshold=0.8, 1 instance is classified as spam (TP), and 1 instance is incorrectly classified as non-spam (FP), so TPR = 1/3 and FPR = 0. The confusion matrix for the same would be:

At threshold=0.7, 2 instances are classified as spam (TP), and 1 instance is incorrectly classified as non-spam (FP), so TPR = 2/3 and FPR = 0.
At threshold=0.6, 3 instances are classified as spam (TP), and 0 instance is incorrectly classified as non-spam (FP), so TPR = 1 and FPR = 0.
At threshold=0.4, 3 instances are classified as spam (TP), and 2 instances are incorrectly classified as non-spam (FP), so TPR = 1 and FPR = 1/3.
At threshold=0.3, 4 instances are classified as spam (TP), and 1 instances are incorrectly classified as non-spam (FP), so TPR = 1 and FPR = 1/2.
At threshold=0.2, 4 instances are classified as spam (TP), and 2 instances are incorrectly classified as non-spam (FP), so TPR = 1 and FPR = 3/4.
At threshold=0.1, all instances are classified as spam, so TPR = 1 and FPR = 1.

Now, we can plot these points in a curve manually but we won’t because we are programmers! We will get our computers to do it! Let’s look at how we can do it using python.

The code below, will plot the roc curve for you.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, auc

# Example data
actual_labels = np.array([1, 0, 1, 0, 1, 0, 0])
predicted_probs = np.array([0.8, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1])

# Compute ROC curve and ROC area
fpr, tpr, thresholds = roc_curve(actual_labels, predicted_probs)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

The output will be something like this:

The ROC-AUC score is given at the corner of the graph, but if you want to calculate it separately without the graph for your dataset, you can do it in the following way:

from sklearn.metrics import roc_auc_score

# Example data
actual_labels = np.array([1, 0, 1, 0, 1, 0, 0])
predicted_probs = np.array([0.8, 0.7, 0.6, 0.4, 0.3, 0.2, 0.1])

# Calculate ROC AUC score
roc_auc = roc_auc_score(actual_labels, predicted_probs)
print("ROC AUC score:", roc_auc)

The output will be:

ROC AUC score: 0.75

You might also like: Wine Classification using Python – Easily Explained.

Inference.

A score of 0.75 says that our model has moderately good power of classifying between true and false classes. If the ROC-AUC score is above 0.5 then our model is pretty good at discriminating between false and true classes. If the score is between 0.7-0.8, it is very good but there is still room for improvement. A score of 1 says that the model can accurately predict every instance of the class correctly.

A score below 0.5 is low and the model’s performance is not very good and it needs more fine-tuning data cleaning or pre-processing. We’ve discussed how you can implement and interpret the roc-auc score of a particular model. How might you leverage this metric to refine your machine-learning projects further?