To evaluate a model performance it is essential that we know about prediction errors mainly – bias and variance. Bias Variance tradeoff is a very essential concept in Machine Learning.
Having a Proper understanding of these errors would help to create a good model while avoiding Underfitting and Overfitting the data while training the algorithm.
In this article we will understand this essential concepts.
What is Bias?
Bias is the difference between the average prediction of our model and the correct target value that the model is trying to predict.
Model having high Bias would oversimply our model and result in more difference in the actual and the predicted value.
To understand Bias let’s look at the figure below:
It is very clear from the figure above that the model or the line did not fit the data well, This is famously termed as Underfitting. This is an example of having High Bias as the difference between the actual value (Blue Data points) and the Predicted values (Red Line) is high.
It always leads to high error on training and test data.
What is Variance?
Variance is the variability of model prediction for a given data point which tells us spread of our data. So what does high variance looks like?
Models with high variance has a very complex fit to the data, which basically means that our model just memorized the training data. Due to this our model is not able to give correct predictions on the previously unseen data.
such models will perform very well on training data but has high error rates on test data.
This is known as overfitting.
What is the total error?
Bias and Variance is given by:
- Bias[f'(X)] = E[f'(X) – f(X)]
- Variance[f'(X)] = E[X^2]−E[X]^2
where f(X) is the true value and f'(x) is our model function to predict values close to f(X)
The only important point to notice here is that total error in a model is comprised of three elements.
Total Error = Bias² + Variance + irreducible error
Total error is the sum of Bias², variance and the irreducible error.
Here Irreducible error is the error that can’t be reduced. It is the inherent noise in our data. But we can certainly have control over the amount of Bias and Variance a model can have.
Hence we try to obtain the Optimal values for Bias and Variance by varying the model complexity. we find a good balance between bias and variance such that the total error is minimum.
Now what is Bias Variance Tradeoff?
If we have a very simple model, this means that we have a high bias, and low variance, as we have seen in the previous section. Similarly, if we get a complex fit on our training data we say that model has high variance and low bias. Either way, we won’t get good results.
So Bias Variance Tradeoff implies that there must be an appropriate balance between model bias and variance so that the total error is minimized without overfitting and underfitting the data.
An optimal balance between bias and variance would never result in overfitting or underfitting.
Example of Bias Variance Tradeoff in Python
Let’s see how we can calculate bias and variance of a model. run this line on the command prompt to get the package.
pip install mlxtend
You can download the dataset used in this example here (Filename – score.csv).
Let’s see how we can determine the Bias and Variance of a model using mlxtend library.
#Importing the required modules from mlxtend.evaluate import bias_variance_decomp from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error import pandas as pd import numpy as np #Reading the dataset df = pd.read_csv('score.csv') x = np.array(df.Hours).reshape(-1,1) y = np.array(df.Scores).reshape(-1,1) #Splitting the dataset into train and test set x_train,x_test, y_train, y_test = train_test_split(x,y, test_size = 0.4 , random_state = 0) #Making the model regressor = DecisionTreeRegressor(max_depth = 1) #Fitting the data to the model regressor.fit(x_train,y_train) #Calculating Bias and Variance avg_expected_loss, avg_bias, avg_var = bias_variance_decomp( regressor, x_train, y_train, x_test, y_test, loss='mse', random_seed=1) #Plotting the results x= np.linspace(min(x_train) , max(x_train), 100) plt.plot(x, regressor.predict(x)) plt.scatter(x_train , y_train , color = 'red') plt.xlabel('Hours') plt.ylabel('Score') plt.title('Model with a High Bias') print('average Bias: ',avg_bias) print('average Variance: ',avg_var)
average Bias: 10455.986051700678 average Variance: 61.150793197489904
The above plot clearly shows that our model didn’t learn well and hence has a high bias because we set the max depth of the tree as 1. Such a model when evaluated on a test set will yield poor results.
You can try playing with the code on a different dataset and using a different model and changing the parameters to get a model that has low bias and low variance.
Bias and Variance play an important role in deciding which predictive model to use. In this article, we learned about Bias and Variance Tradeoff, what underfitting and overfitting look like. and finally, we learned that a good model is one that will have low bias error as well as low Variance error.