Ensemble Learning - A Complete Beginner’s Guide

Machine Learning is the current technological trend sweeping all the research and publications. It is a rapidly growing and always on-demand field of technology. Providing advancements in crucial industries like Health Care, Retail, Automobile, and so on, mastering this field is undoubtedly an asset to our careers. There are many algorithms or techniques used in machine learning, each having its use case.

Learn more about Machine Learning here.

We are going to learn about one such technique of machine learning – Ensemble Learning.

Ensemble in English generally means a group of principals(musicians, actors, or dancers ) who perform together or work together.

Ensemble learning is based on the same concept. Ensemble Learning combines or works with predictions of multiple machine learning models to determine the best accurate model.

What Is Ensemble Learning?

To explain the idea in simple terms, ensemble learning is a type of machine learning technique that combines the results and accuracies of various models and determines the best one out of them. Ensemble Learning is a supervised learning algorithm and can be used to solve both classification and regression tasks. However, it is prevalently used to deal with classification problems.
We can say that ensemble learning is supervised because we train the model based on the labeled data concerning its target or outputs. Later, this model is used to test new data.

Start with supervised machine learning from here.

By doing this, we can improve the overall performance by combining the results of multiple models.

Let us try to understand the concept with the help of a flow chart

There are a few techniques used in ensemble learning to combine the results of various models. Let us try to understand each of them one by one.

Techniques of Ensemble Learning

The models we use in ensemble learning are often called ensembles or weak learners. Generally, models with accuracies in the range of 50% are considered weak learners as they do have some knowledge about the data but cannot provide a solution that we can completely rely on.

Such weak learners are combined in ensemble learning to produce a strong learner or a strong classifier whose accuracy is much better and this helps us to maintain the bias-variance tradeoff of the model.

Learn about the bias-variance tradeoff here.

The techniques of Ensemble Learning are given below.

Averaging
Voting
Bagging
Boosting

Let us learn about these techniques one by one.

Averaging

You must have guessed it by now. Averaging is the technique by which the predictions or accuracies of the weak learners are used to calculate the mean or average, which becomes the strong learner(the final prediction).

Let us take an example. For a regression problem, we used the following three models: Support Vector Machine, Linear Regression, and Lasso Regression. The accuracies of these models are 55, 89, and 72, respectively. To determine the final model, we calculate the average of the accuracies of these three models, which comes up to 72.

Voting

In elections, we cast our votes for the party we want to win. All the votes are counted and the party with the highest number of votes is declared as a winner. A similar concept is used here. The model with maximum accuracy or prediction is selected as the final model or classifier.

There are two types of voting used in ensemble learning.

Hard Voting: Hard voting is the simplest form of voting. For a given input, each weak learner predicts a class. The class with a majority vote is decided as the final classifier. Hard voting is also called majority voting.

For example, there are three models which have to predict an image as a cat or dog. The predictions of each model are given as [‘Cat’,’Dog’,’ Cat’]. Since the class ‘Cat’, has two votes, the image is classified as a cat.

Soft Voting: Soft Voting is a bit different concept. When the classifiers provide confidence scores or assign probabilities for each class, soft voting calculates the average probabilities for each class provided by all classifiers and determines the one with the highest average as the output class.

For the same example as above, let us say Model 1 has assigned a probability of 0.8 to the class ‘Cat’ and 0.2 to the class ‘Dog’.
Model 2 assigns the probability of 0.6 to ‘Cat’ and 0.4 to ‘Dog’.
The last model assigns a probability of 0.3 to ‘Cat’ and 0.7 to ‘Dog’.

The soft voting classifier calculates the average of each class and predicts the class with the highest average as the final prediction.

The calculation is as follows.

For the class Cat:
= (0.8+0.6+0.3)
=(0.17)/3
=0.5666
For the class Dog:
=(0.2+0.4+0.7)
=(1.3)/3
=0.4333

Bagging

Bagging in ensemble learning creates multiple subsets from the given dataset, allowing replacement which means the same data point can be taken multiple times. The process of creating multiple subsets from the original dataset is called bootstrapping, and the process of combining the results of these subsets is called aggregation. Hence bagging is often called bootstrap aggregation. One important point to note is that in bagging, the models working on the subsets are trained parallelly.

Bagging is mainly used to decrease the variance of the models.

Boosting

Unlike Bagging, the boosting technique follows a sequential approach. A model is trained on a given dataset, and its performance is calculated. A second model is trained on the errors of the first one where the error decreases. This process is repeated until the error decreases or the accuracy increases. The models are trained one by one; hence this process is called sequential. Boosting is also iterative as it works continuously until minimum error is achieved.

There are various types of boosting algorithms, such as Adaboost, Gradient Boosting, and XG boost.

From the flowchart above, we can understand that the error of the previous model or classifier is supplied as input to the next classifier to work on. This is iteratively done until the error reduces.

Learn more about AdaBoost here.

These are the frequently followed ensemble techniques in machine learning.

Example of Ensemble Learning – Random Forest

Random Forest is a classic example of ensemble learning which can be used for both regression and classification tasks. The reason behind its name is interesting. This algorithm is based on decision trees. At each node of the tree, it selects random features, which further helps to avoid overfitting and prevents the models from giving similar accuracies. The classifiers or models used in this algorithm are called trees. Hence this algorithm is called a random forest.

Regression vs Classification

Random Forest is based on bagging and uses the concept of voting for classification and averaging for solving regression tasks.

Let us see the implementation of random forest for regression.

Random Forest for Regression

The scikit learn library has a unique function for performing regression using random forest. The syntax is as follows.

class sklearn.ensemble.RandomForestRegressor(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)

Let us know the code for training a random forest model on the prices of diamonds with the help of their features.

Step 1.1- Importing the necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

The pandas library is used to read the data. The sklearn library is imported to be able to use the Random`Forest` algorithm and also to split the dataset into training and testing sets. The performance metrics are also present within the sklearn library.

The matplotlib library is used to plot the tree.

Step 1.2- Loading the Data

data = pd.read_csv('diamonds.csv')
data

We are storing the diamonds dataset in an instance called data. Since the dataset is in csv format, we are using the read_csv of the pandas library to load the dataset. In the next line, this dataset is printed.

Step 1.3-Data Preprocessing

Often when we load the data from insecure sources or when the data has missing values, a column called Unnamed:0 appears in our dataset after loading. The data preprocessing step is crucial to remove such irrelevant columns and fill out any missing values in the data.

data.drop("Unnamed: 0", axis=1, inplace=True)
data
data.info()
data.isnull().sum()
data.clarity.value_counts()
df = data
df["clarity"].replace({"SI1":0,"VS2":1,"SI2":2,"VS1":3,"VVS2":4,"VVS1":5,"I1":6,"IF":7})

The data.info() is used to give a general description of the data types present in the dataset.

isnull.sum() is used to determine the total number of null values present in each label of the dataset. In this case, you can observe that there are no null values in our dataset(refer to the third image below).

Since we are dealing with numerical data, string data types are not accepted by the RandomForestRegressor. If you see the label – clarity, it holds a string data type. While training the model, it looked like the label was important to predict the model’s performance, so we did not drop the column instead, we converted the string datatype to an integer(last image).

Here is a snapshot of the results of the data preprocessing steps performed above.

Step 1.4-Label Encoding

This step is similar to the above. Here, we define classes for the labels cut and color.

from sklearn.preprocessing import LabelEncoder
l = LabelEncoder()
df["cut"] = l.fit_transform(df['cut'])
df["clarity"] = l.fit_transform(df['clarity'])
df["color"] = l.fit_transform(df['color'])
m = dict(zip(l.classes_, l.transform(l.classes_)))
m

The variable m consists of the numerical mappings for the categorical labels.

Step 1.5 – Data Splitting

After the data is cleaned and the labels are encoded, we now have to split the data into training and testing sets. Before that, we need to decide upon the dependent and independent variables. Looking at the data, we can determine that the label price is the dependent label which means it depends on the other features. What do we do?

We simply take two variables x and y, consider the features other than the price variable as independent variables, and the price variable is considered as a dependent variable.

X = data.drop('price', axis=1)
y = data['price']
X.shape
y.shape

Deciding Dependent and Independent Variables

Let us see how we can split the data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=812)
rf_model = RandomForestRegressor(n_estimators=15, random_state=812)
rf_model.fit(X_train, y_train)
y_pred = rf_model.predict(X_test)

We are dividing the data into two partitions- training and testing where the test data size is 15% of the original data meaning we are going to train with 85% of the data. There are four variables(X_train, X_test,y_train.y_test) because we are also splitting the dependent and independent variables.

Here comes the MVP of the example- the RandomForestRegressor. This algorithm is stored in an instance called rf_model. We are using 15 trees in our example, and the random state is used so that the output remains the same every time you run the model. The model is fit according to the training sets of the dependent and independent variables. The predict function is used to predict how well the model is performing.

Step 1.4-Performance Metrics

The performance metrics we mainly use for regression analysis are mean square error(MSE), mean absolute error(MAE), and R-square.

Mean Absolute Error is the absolute average difference between the predicted and actual data. Mean Square Error results in the squared average distance between the real and predicted data. R2 score, also known as the coefficient of determination, is defined as the proportion of the variance in the dependent variable that is predictable from the independent variables.

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("The mean absolute error is :",mae)
print("The mean square error is :",mse)
print("The R-Squared error is:",r2)

We can see that the R2 score is around 97% which means that the model is doing very well and has low variance.

We can also plot the trees of the forest, but it can take a long time to run and is very computationally expensive. The code for plotting the tree is given below.

chosen_tree = 0
plt.figure(figsize=(20, 10))
plot_tree(rf_model.estimators_[chosen_tree], feature_names=data.columns.tolist(), filled=True)
plt.show()

Sample Future Prediction

In the code given below, we are taking a sample record from the dataset and following the same process of deciding dependent and independent variables and calculating the error of prediction.

The error is calculated to know how close the actual results and predicted results come up to be.

new = data.sample(1)
new
X_new = new.drop("price", axis=1)
y_new = new.price
X_new, y_new
y_pred_new = int(rf_model.predict(X_new))
y_new.iloc[0], y_pred_new
error = (y_pred_new-y_new.iloc[0])
error

Summary

To summarize what we have done so far, we understood what ensemble learning is. To recapitulate, ensemble learning is a technique in machine learning that uses multiple models to solve a task and combines the predictions of these models to bring up a stronger classifier.

We have understood the basic techniques used in ensemble learning, such as Voting, Averaging, Bagging, and Boosting with the help of diagrammatic representations. These techniques form the base for designing any ensemble learning and determine the approach to use for combining the results of the weak learners.

As an example of ensemble learning, we looked at the implementation of Random Forest for regression and looked at the primal steps of any model building like Loading data, data cleaning, data preprocessing, splitting the data, and so on.

Dataset

Diamonds dataset

References

Random Forest Regressor

R2 score

Mean Absolute Error

Mean Squared Error