Python catboost module: A Brief Introduction to CatBoost Classifier

Hello learner! In this tutorial, we will be learning about the catboost module and a little more complex concept known as CatboostClassifier. So let’s begin!

What is the catboost module?

CatBoost module is an open-source library that is fast, scalable, a very high-performance gradient boosting system on decision trees and other Machine Learning tasks. It also offers GPU support to speed up training

Catboost cab be used for a range of regression and classification problems which are available on kaggle as well.

Implementing the Catboost Classifier

1. Importing Modules

For the simple implementation of the catboost module, we will be importing three modules. The catboost module obviously and matplotlib for data visualization along with numpy module to generate datasets.

If any of the module import gives an error make sure you install the module using the pip command. The code to import the right modules and right function is shown below.

from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
import numpy as np

2. Training and Testing Data Preparation

The next step is to create testing data for training the catboost module and then creating testing data to check for random points.

Training Data

To create sample training data we need two matrices one for mean and other one for covariance where the mean describes the center of the points and covariance describes the spread of the point.

Later we create a multivariant normal distribution passing the mean and covariance matrix along with the number of points.

The code to create data for two different classes is shown below.

mean1=[8,8]
covar1=[[2,0.7],[0.7,1]]
d2=np.random.multivariate_normal(mean1,covar1,200)

mean2=[1,1]
covar2=[[2,0.7],[0.7,1]]
d2=np.random.multivariate_normal(mean2,covar2,200)

Testing Data

To get training points we will be importing random module and generate 10 random x and y coordinates to pass to the trained model later on. The next step is to put the x and y coordinates together in a list using the for loop.

The code for the same is shown below.

import random
x_cord_test = [random.randint(-2,10) for i in range(5)]
y_cord_test = [random.randint(-2,10) for i in range(5)]
test_data = []
for i in range(len(x_cord_test)):
    test_data.append([x_cord_test[i],y_cord_test[i]])

Data Visualization – 1

We would be visualizing the data using the matplotlib library and plot the training data along with the testing points as well.

The code for the same is shown below.

plt.style.use('seaborn')
plt.scatter(d1[:,0],d1[:,1],color="Red",s=20)
plt.scatter(d2[:,0],d2[:,1],color="Blue",s=20)
for i in test_data:
    plt.scatter(i[0],i[1],marker="*",s=200,color="black")
plt.show()

The resulting graph is shown below.

Final training data for the model preparation

The final step would be to create the final training data by combining the data for two classes together into a single data frame.

The no of rows in the resulting data would be equal to sum of no of data points in both the classes. The number of columns will be equal to 3 where the columns store the x and y coordinates and label of the point.

We created a dummy dataframes with all values as 0. Then we put the data for two classes along with the label into the correct position in the dataframe. The last step involves shuffling of the data.

df_rows=d1.shape[0]+d2.shape[0]
df_columns=d1.shape[1]+1

df=np.zeros((df_rows,df_columns))

df[0:d1.shape[0],0:2]=d1
df[d1.shape[0]:,0:2]=d2
df[0:d1.shape[0],2]=0
df[d1.shape[0]:,2]=1

np.random.shuffle(df)

Data Visualization – 2

Now let’s visualize our final data using the code below.

plt.scatter(df[:,0],df[:,1],color="Green")
for i in test_data:
    plt.scatter(i[0],i[1],marker="*",s=200,color="black")
plt.show()

The final graph is shown below. Now data is ready to go into the CatBoostClassifier.

3. Using the catboost module – CatBoostClassifier

To implement the CatBoostClassifier we create our model object for the same which takes the no of iterations as a parameter. We will also be using GPU for the model so we pass the tak_type as a parameter.

The next step is fitting the training data points and labels to train the model using the fit function. We will also pass each testing point into the predict function and get the results.

model = CatBoostClassifier(iterations=100,task_type="GPU")
model.fit(df[:,0:2],df[:,2],verbose=False)

The results are as follows. You can cross check from the graph that the results are pretty accurate.

(6,3) ==> 0.0
(10,4) ==> 0.0
(6,-2) ==> 0.0
(1,7) ==> 1.0
(3,0) ==> 1.0

Conclusion

Congratulations! Today you successfully learned about a fast and amazing Classifier known as CatBoost. You can try out the same on various datasets of your own! Happy Coding!

Thank you for reading!