Random Forest Regression: A Complete Reference

Welcome to this article on Random Forest Regression. Let me quickly walk you through the meaning of regression first.

What is Regression in Machine Learning?

Regression is a machine learning technique that is used to predict values across a certain range. Let us see understand this concept with an example, consider the salaries of employees and their experience in years.

A regression model on this data can help in predicting the salary of an employee even if that year is not having a corresponding salary in the dataset.

What is Random Forest Regression?

Random forest regression is an ensemble learning technique. But what is ensemble learning?

In ensemble learning, you take multiple algorithms or same algorithm multiple times and put together a model that’s more powerful than the original.

Prediction based on the trees is more accurate because it takes into account many predictions. This is because of the average value used. These algorithms are more stable because any changes in dataset can impact one tree but not the forest of trees.

Steps to perform the random forest regression

This is a four step process and our steps are as follows:

Pick a random K data points from the training set.
Build the decision tree associated to these K data points.
Choose the number N tree of trees you want to build and repeat steps 1 and 2.
For a new data point, make each one of your Ntree trees predict the value of Y for the data point in the question, and assign the new data point the average across all of the predicted Y values.

Implementing Random Forest Regression in Python

Our goal here is to build a team of decision trees, each making a prediction about the dependent variable and the ultimate prediction of random forest is average of predictions of all trees.

For our example, we will be using the Salary – positions dataset which will predict the salary based on prediction.

The dataset used can be found at https://github.com/content-anu/dataset-polynomial-regression

1. Importing the dataset

We’ll use the numpy, pandas, and matplotlib libraries to implement our model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dataset = pd.read_csv('Position_Salaries.csv')
dataset.head()

The dataset snapshot is as follows:

2. Data preprocessing

We will not have much data preprocessing. We will just have to identify the matrix of features and the vectorized array.

X = dataset.iloc[:,1:2].values
y = dataset.iloc[:,2].values

3. Fitting the Random forest regression to dataset

We will import the RandomForestRegressor from the ensemble library of sklearn. We create a regressor object using the RFR class constructor. The parameters include:

n_estimators : number of trees in the forest. (default = 10)
criterion : Default is mse ie mean squared error. This was also a part of decision tree.
random_state

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)

The regressor line is as follows:

We will just make a test prediction as follows:

y_pred=regressor.predict([[6.5]])
y_pred

Output of the prediction

4. Visualizing the result

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 

plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points

plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

The graph produced is as shown below:

5. Interpretation of the above graph

We get many steps in this graph than with one decision tree. We have a lot more of intervals and splits. We get more steps in our stairs.

Every prediction is based on 10 votes (we have taken 10 decision trees). Random forest calculates many averages for each of these intervals.

The more number of trees we include, more is the accuracy because many trees converge to the same ultimate average.

6. Rebuilding the model for 100 trees

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X,y)

The regressor equation formed for the above 100 trees is as follows:

7. Creating the graph for 100 trees

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 
plt.scatter(X,y, color='red') 

plt.plot(X_grid, regressor.predict(X_grid),color='blue') 
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

The steps of the graph don’t increase 10 times as the number of trees in the forest. But the prediction will be better. Let’s predict the result of the same variable.

y_pred=regressor.predict([[6.5]])
y_pred

Output prediction

8. Rebuilding the model for 300 trees

from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 300, random_state = 0)
regressor.fit(X,y)

The output for the above code snippet produces the following regressor:

9. Graph for 300 trees

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 

plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points

plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

The above code produces the following graph:

Now, let us make a prediction.

y_pred=regressor.predict([[6.5]])
y_pred

The output for the above code is as follows:

Prediction using 300 trees

Complete Python Code for Implementing Random Forest Regression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
dataset = pd.read_csv('Position_Salaries.csv')
dataset.head()

X = dataset.iloc[:,1:2].values
y = dataset.iloc[:,2].values

# for 10 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)

y_pred=regressor.predict([[6.5]])
y_pred

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 
 
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
 
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()


# for 100 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X,y)

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 
plt.scatter(X,y, color='red') 
 
plt.plot(X_grid, regressor.predict(X_grid),color='blue') 
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

y_pred=regressor.predict([[6.5]])
y_pred

# for 300 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 300, random_state = 0)
regressor.fit(X,y)

#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1) 
 
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
 
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

y_pred=regressor.predict([[6.5]])
y_pred

The output of the above code will be graphs and prediction values. Below are the graphs:

Conclusion

As you have observed, the 10 trees model predicted the salary for 6.5 years of experience to be 167,000. The 100 trees model predicted 158,300 and the 300 trees model predicted 160,333.33. Hence more the number of trees, the more accurate is our result.