I have joined a lot of Kaggle competitions in the past, and for the past 3-4 years, all the top winning submissions use some form of gradient boosting. Therefore, we will look at it closely today.
What is Gradient Boosting?
Ensemble Learning: To obtain improved predictive efficiency than could be extracted from any of the constituent learning algorithms alone, ensemble approaches use multiple learning algorithms.
It is very common that there are biases or variances in the individual model, and that is why we need to learn ensemble learning.
Bagging and boosting are the two most common ensemble techniques.
- Bagging: Parallel training with a lot of models. Each model is trained by a random subset of the data.
- Boosting: Sequentially teaching a lot of additional models. From errors made by the previous model, each particular model learns.
While you have already learned bagging techniques previously (like Random Forest), let’s look at what boosting is.
A category of machine learning algorithms that merge several weak learning models together to produce a strong predictive model called gradient boosting classifier.
When doing gradient boosting, decision trees are typically used. Because of their effectiveness in classifying complex datasets, gradient boosting models are becoming common, and have recently been used to win several competitions in Kaggle data science!
Scikit-Learn, the Python machine learning library, supports various gradient-boosting classifier implementations, including XGBoost, light Gradient Boosting, catBoosting, etc.
What is XGBoost?
XGBoost is the leading model for working with standard tabular data (as opposed to more exotic types of data like images and videos, the type of data you store in Pandas DataFrames). Many Kaggle competitions are dominated by XGBoost models.
XGBoost models require more expertise and model tuning to achieve optimum precision than strategies such as Random Forest.
And it’s super easy.
Implementation of Gradient Boosting on House Prices Dataset
I am using a very popular dataset from Kaggle.com called the House Price Prediction (HPP) Dataset.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.
Let’s get started!
1. Import Required Packages
Let’s import our important packages:
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.impute import SimpleImputer from xgboost import XGBRegressor
The imputer is used to “impute”(replace) NaN values in a dataset with either the mean, mode, or any other method of choice.
2. Setting up the data
Let’s import in our training data:
data_train = pd.read_csv('train.csv') data_train.dropna(axis=0, subset=['SalePrice'], inplace=True) data_train.head(1)
We drop those rows that have NaN in the SalePrice because that is our most important measure.
We’ll assign the SalePrice as the labels (i.e., AX = y format):
y = data_train.SalePrice X = data_train.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
We divide the data into train and test data in a 3:1 ratio, using sklearn’s train_test_split function:
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)
Let’s impute NaN values in the dataset:
my_imputer = SimpleImputer() train_X = my_imputer.fit_transform(train_X) test_X = my_imputer.transform(test_X)
And we’re done with the preprocessing for now. We could obviously tune each column of the dataset, find outliers, regularize, etc. but that is your homework!
3. Creating the Model
Let’s create our model:
my_model = XGBRegressor() my_model.fit(train_X, train_y, verbose=True)
As you can see in your output, these are all the parameters that we can specify to tune our model:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, importance_type='gain', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
We can make our predictions now:
predictions = my_model.predict(test_X) predictions
and that gives us:
We can also find our regression error which comes out to be ~17000 for us.:
from sklearn.metrics import mean_absolute_error print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))
Complete Code Implementation for Gradient Boosting
If you missed out on any step, you’ll find the full code here along with the dataset:
Other forms – light GBM and catBoost
The usage is exacctly same as XGB:
from lightgbm import LGBMRegressor my_model = LGBMRegressor() my_model.fit(train_X, train_y, verbose=True)
from catboost import CatBoostRegressor my_model = CatBoostRegressor() my_model.fit(train_X, train_y, verbose=True)
The process is the same.
LightGBM: Light GBM, based on the decision tree algorithm, is a fast, distributed, high-performance gradient boosting system used for ranking, classification, and many other tasks in Machine Learning.
It divides the tree leaf wise for the best match, while other boosting algorithms break the tree depth wise or level wise instead of leaf-wise.
The leaf-wise algorithm can therefore minimize more losses than the level-wise algorithm when increasing on the same leaf in Light GBM, resulting in much higher precision that can rarely be accomplished by any of the current boosting algorithms.
It is also surprisingly very fast, too. There is a significant difference in the execution time for the training procedure of lightGBM, so nowadays it is more preferred as a “quick fix“
CatBoost : As a better gradient boosting algorithm, Catboost implements ordered boosting, but the biggest advancement in catboost is how it deals with categorical information. Since it needs to provide a numerical encoding, categorical data introduces many problems.
Catboost uses a target encoding variant that determines the target encoding with available history and a random permutation to encode our categorical data and process it. Instead of using the mean, Catboost uses the available context, since a model running in real time does not know the true mean for its target.
There are several benchmark tests that people have done for all the above algorithms. Go through them:
However, it is the overall narrative that catBoost is slow and not very effective. Try doing your own benchmark tests, and let us know in the comments which you prefer.
Gradient boosting is a powerful mechanism for data classification and regression and can speed up your path to learning new machine learning algorithms.