Welcome to this article on Random Forest Regression. Let me quickly walk you through the meaning of regression first.
What is Regression in Machine Learning?
Regression is a machine learning technique that is used to predict values across a certain range. Let us see understand this concept with an example, consider the salaries of employees and their experience in years.
A regression model on this data can help in predicting the salary of an employee even if that year is not having a corresponding salary in the dataset.
What is Random Forest Regression?
Random forest regression is an ensemble learning technique. But what is ensemble learning?
In ensemble learning, you take multiple algorithms or same algorithm multiple times and put together a model that’s more powerful than the original.
Prediction based on the trees is more accurate because it takes into account many predictions. This is because of the average value used. These algorithms are more stable because any changes in dataset can impact one tree but not the forest of trees.
Steps to perform the random forest regression
This is a four step process and our steps are as follows:
- Pick a random K data points from the training set.
- Build the decision tree associated to these K data points.
- Choose the number N tree of trees you want to build and repeat steps 1 and 2.
- For a new data point, make each one of your Ntree trees predict the value of Y for the data point in the question, and assign the new data point the average across all of the predicted Y values.
Implementing Random Forest Regression in Python
Our goal here is to build a team of decision trees, each making a prediction about the dependent variable and the ultimate prediction of random forest is average of predictions of all trees.
For our example, we will be using the Salary – positions dataset which will predict the salary based on prediction.
The dataset used can be found at https://github.com/content-anu/dataset-polynomial-regression
1. Importing the dataset
We’ll use the numpy, pandas, and matplotlib libraries to implement our model.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv('Position_Salaries.csv')
dataset.head()
The dataset snapshot is as follows:

2. Data preprocessing
We will not have much data preprocessing. We will just have to identify the matrix of features and the vectorized array.
X = dataset.iloc[:,1:2].values
y = dataset.iloc[:,2].values
3. Fitting the Random forest regression to dataset
We will import the RandomForestRegressor from the ensemble library of sklearn. We create a regressor object using the RFR class constructor. The parameters include:
- n_estimators : number of trees in the forest. (default = 10)
- criterion : Default is mse ie mean squared error. This was also a part of decision tree.
- random_state
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)
The regressor line is as follows:

We will just make a test prediction as follows:
y_pred=regressor.predict([[6.5]])
y_pred

4. Visualizing the result
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
The graph produced is as shown below:

5. Interpretation of the above graph
We get many steps in this graph than with one decision tree. We have a lot more of intervals and splits. We get more steps in our stairs.
Every prediction is based on 10 votes (we have taken 10 decision trees). Random forest calculates many averages for each of these intervals.
The more number of trees we include, more is the accuracy because many trees converge to the same ultimate average.
6. Rebuilding the model for 100 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X,y)
The regressor equation formed for the above 100 trees is as follows:

7. Creating the graph for 100 trees
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red')
plt.plot(X_grid, regressor.predict(X_grid),color='blue')
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()

The steps of the graph don’t increase 10 times as the number of trees in the forest. But the prediction will be better. Let’s predict the result of the same variable.
y_pred=regressor.predict([[6.5]])
y_pred

8. Rebuilding the model for 300 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 300, random_state = 0)
regressor.fit(X,y)
The output for the above code snippet produces the following regressor:

9. Graph for 300 trees
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
The above code produces the following graph:

Now, let us make a prediction.
y_pred=regressor.predict([[6.5]])
y_pred
The output for the above code is as follows:

Complete Python Code for Implementing Random Forest Regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv('Position_Salaries.csv')
dataset.head()
X = dataset.iloc[:,1:2].values
y = dataset.iloc[:,2].values
# for 10 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 10, random_state = 0)
regressor.fit(X,y)
y_pred=regressor.predict([[6.5]])
y_pred
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# for 100 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 100, random_state = 0)
regressor.fit(X,y)
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red')
plt.plot(X_grid, regressor.predict(X_grid),color='blue')
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
y_pred=regressor.predict([[6.5]])
y_pred
# for 300 trees
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 300, random_state = 0)
regressor.fit(X,y)
#higher resolution graph
X_grid = np.arange(min(X),max(X),0.01)
X_grid = X_grid.reshape(len(X_grid),1)
plt.scatter(X,y, color='red') #plotting real points
plt.plot(X_grid, regressor.predict(X_grid),color='blue') #plotting for predict points
plt.title("Truth or Bluff(Random Forest - Smooth)")
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
y_pred=regressor.predict([[6.5]])
y_pred
The output of the above code will be graphs and prediction values. Below are the graphs:

Conclusion
As you have observed, the 10 trees model predicted the salary for 6.5 years of experience to be 167,000. The 100 trees model predicted 158,300 and the 300 trees model predicted 160,333.33. Hence more the number of trees, the more accurate is our result.