Detecting Anomalies with Pycaret: A Step-by-Step Guide

Anomaly Detection With Pycaret

The applications of machine learning are growing rapidly and undoubtedly, are making business, education, and marketing easier and hassle-free. One such application of machine learning is anomaly detection.

Anomaly detection is the process of identifying unusual patterns or behaviors in data that deviate significantly from the norm. It has wide-ranging applications, from detecting fraudulent transactions and network intrusions to identifying cancers and tumors in medical imaging. By pinpointing these anomalies, organizations can prevent potential threats, financial losses, and health risks. Let’s get right into the topic.

Also read: Machine Learning Workflows with Pycaret in Python

What is Anomaly Detection?

Anomaly Detection is the process of determining any unusual behavior in the data which differs greatly as compared to the data distribution. Anomaly Detection is used to detect fraudulent transactions, cancers or tumors in medical imaging, unusual behavior of proteins in human and animal bodies, outliers in data analysis, and many more to mention.

Pycaret Functions for Anomaly Detection

Pycaret is an amazing library that automates the machine-learning process from development to deployment. It has an amazing set of functions for classification, regression, clustering, and anomaly detection. Let us look at the functions used to perform anomaly detection.

Also read: One-class SVM for anomaly detection

1. Setup

Setup is the most important function that should be called before we proceed with any other function of pycaret anomaly detection or any other module. The difference between using the setup function for classification/regression and anomaly detection is that we don’t need to provide the target label when setting up anomaly detection as it is an unsupervised ML technique.

s = setup(data)

2. Models

The models() function is used to get a list of all the models available for anomaly detection.

3. Create Model

The create_model is used to create a model available in the list for our data. This is the primary step in model building.

create_model(model_name)

4. Evaluate Model

The evaluate_model function is used to analyze the model’s performance. This function yields different outputs for different modules. In the case of anomaly detection, this is the output.

Evaluate Model
Evaluate Model

5. Assign Model

The assign_model is particularly used in anomaly detection. It is used to get the results of using the model on our data.

result = assign_model(model_name)

6. Plot Model

The plot model is generally used to analyze the model’s performance on a test or unseen dataset. However, it can also be used to visualize the results.

7. Save Model

The save_model function is used to save the trained model in the form of a pickle file. Saving the model on the disk will help us to load it and use it again in the future.

Example: Anomaly Detection with Pycaret

Let us see an example of using the anomaly detection module of the pycaret library on a popular dataset called mice. The mice dataset is generally used for data imputation techniques as it has a lot of missing values. It contains expression levels of 77 proteins present in the cerebral cortex.

We are going to use this dataset to determine if the test subject has Down syndrome(anomaly) or is normal.

Loading and Splitting the Data

In this section, we are going to see how to load the mice data and split it into train and test sets.

import pycaret 
from pycaret.datasets import get_data
dataset = get_data('mice')

The get_data function is used to load any data available in the pycaret.datasets package into our notebook. The data loaded is stored in an object called dataset.

Dataset
Dataset
train = dataset.sample(frac = 0.95,random_state = 786)
train.head()

The sample function of pycaret is used to randomly sample a few records from the data. We are using this function to split the data into a training set with 95% records. The random_state is used to keep the distribution uniform. The first five records of the train set are printed in the next line.

Training data
Training data

Now, we define the test set.

test = dataset.drop(train.index)
test.head()

In the above code snippet, we are dropping all the indices that are included in the training set(train) and storing the remaining records in the object test.

Testing data
Testing data

If we notice, we find that the indices are unordered in the data frames. We can set the order as shown below.

train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

Setting Up Anomaly Detection

Now, we get to the main part. Let us use the anomaly module of pycaret to perform anomaly detection on the mice dataset.

from pycaret.anomaly import *
anomaly_setup = setup(train,normalize = True,session_id = 123)

We are importing all the functions from the anomaly module of pycaret. In the next line, we are setting up the environment with the name anomaly_setup. In this setup, we are using the training set as the data and normalizing the records. The session_id is used for reproducibility and it will keep the data uniform for every run.

Setup
Setup
models()

The models function is used to print all the available models from the anomaly module.

Available Models
Available Models

Let us use the Isolation Forest model for our use case.

#iforest
iforest = create_model("iforest")
print(iforest)

We are creating a model for the isolation forest using the create_model and saving it in a variable called iforest. Now when we print the model, we get the information about the model such as the hyperparameters.

Create Model
Create Model

Creating and Evaluating the Model

If we use the evaluate_model function on the model we just created, this is the output.

Evaluate Model
Evaluate Model

It is time to get the results!

result = assign_model(iforest)
result.head()

If we see the result dataframe, there are two new columns appended to the dataframe that represent the anomaly class(0/1) and the anomaly score.

Anomaly Detection
Anomaly Detection

We can use the plot_model function to visualize the anomalies in the train data.

#3d viz
plot_model(iforest)
3D interactive Plot
3D interactive Plot

The yellow dots in the plot are the anomalies in the train data. We can also plot a 2D visualization as follows.

plot_model(iforest, plot="umap")
2D Plot
2D Plot

Let’s make some predictions using the model on the test dataset.

predictions = predict_model(iforest,test)
predictions.head()
Predictions
Predictions

The model has predicted that the last record has a high anomaly score and is classified as an anomaly.

Saving the Model

Finally, we can save the model.

save_model(iforest,"iforestmodel")
Saved Model
Saved Model

Summary

That is it for this tutorial. We have discussed the definition of anomaly detection, and the useful functions present in pycaret that we can use for performing anomaly detection, and we looked at an example.

The functions applicable for a classification and regression module may not apply to an anomaly detection problem. For example, we can’t create the best model using the compare_models function used in regression problems (we covered it in the previous tutorial) in anomaly detection. Hence, it is important to go through the documentation for each module.

References

Pycaret- Anomaly Detection