Hello learner! In this tutorial, we will be learning about the catboost module and a little more complex concept known as
CatboostClassifier. So let’s begin!
What is the catboost module?
CatBoost module is an open-source library that is fast, scalable, a very high-performance gradient boosting system on decision trees and other Machine Learning tasks. It also offers GPU support to speed up training
Catboost cab be used for a range of regression and classification problems which are available on kaggle as well.
Implementing the Catboost Classifier
1. Importing Modules
For the simple implementation of the catboost module, we will be importing three modules. The
catboost module obviously and matplotlib for data visualization along with
numpy module to generate datasets.
If any of the module import gives an error make sure you install the module using the
pip command. The code to import the right modules and right function is shown below.
from catboost import CatBoostClassifier import matplotlib.pyplot as plt import numpy as np
2. Training and Testing Data Preparation
The next step is to create testing data for training the catboost module and then creating testing data to check for random points.
To create sample training data we need two matrices one for mean and other one for covariance where the mean describes the center of the points and covariance describes the spread of the point.
The code to create data for two different classes is shown below.
mean1=[8,8] covar1=[[2,0.7],[0.7,1]] d2=np.random.multivariate_normal(mean1,covar1,200) mean2=[1,1] covar2=[[2,0.7],[0.7,1]] d2=np.random.multivariate_normal(mean2,covar2,200)
To get training points we will be importing random module and generate 10 random x and y coordinates to pass to the trained model later on. The next step is to put the x and y coordinates together in a list using the for loop.
The code for the same is shown below.
import random x_cord_test = [random.randint(-2,10) for i in range(5)] y_cord_test = [random.randint(-2,10) for i in range(5)] test_data =  for i in range(len(x_cord_test)): test_data.append([x_cord_test[i],y_cord_test[i]])
Data Visualization – 1
We would be visualizing the data using the matplotlib library and plot the training data along with the testing points as well.
The code for the same is shown below.
plt.style.use('seaborn') plt.scatter(d1[:,0],d1[:,1],color="Red",s=20) plt.scatter(d2[:,0],d2[:,1],color="Blue",s=20) for i in test_data: plt.scatter(i,i,marker="*",s=200,color="black") plt.show()
The resulting graph is shown below.
Final training data for the model preparation
The final step would be to create the final training data by combining the data for two classes together into a single data frame.
The no of rows in the resulting data would be equal to sum of no of data points in both the classes. The number of columns will be equal to 3 where the columns store the x and y coordinates and label of the point.
We created a dummy dataframes with all values as 0. Then we put the data for two classes along with the label into the correct position in the dataframe. The last step involves shuffling of the data.
df_rows=d1.shape+d2.shape df_columns=d1.shape+1 df=np.zeros((df_rows,df_columns)) df[0:d1.shape,0:2]=d1 df[d1.shape:,0:2]=d2 df[0:d1.shape,2]=0 df[d1.shape:,2]=1 np.random.shuffle(df)
Data Visualization – 2
Now let’s visualize our final data using the code below.
plt.scatter(df[:,0],df[:,1],color="Green") for i in test_data: plt.scatter(i,i,marker="*",s=200,color="black") plt.show()
The final graph is shown below. Now data is ready to go into the
3. Using the catboost module – CatBoostClassifier
To implement the CatBoostClassifier we create our model object for the same which takes the no of iterations as a parameter. We will also be using
GPU for the model so we pass the
tak_type as a parameter.
The next step is fitting the training data points and labels to train the model using the
fit function. We will also pass each testing point into the
predict function and get the results.
model = CatBoostClassifier(iterations=100,task_type="GPU") model.fit(df[:,0:2],df[:,2],verbose=False)
The results are as follows. You can cross check from the graph that the results are pretty accurate.
(6,3) ==> 0.0 (10,4) ==> 0.0 (6,-2) ==> 0.0 (1,7) ==> 1.0 (3,0) ==> 1.0
Congratulations! Today you successfully learned about a fast and amazing Classifier known as CatBoost. You can try out the same on various datasets of your own! Happy Coding!
Thank you for reading!