What Is Bias And Variance In Python3?

Bias and variance represent distinct concepts in the fields of Machine Learning and Deep Learning. The primary objective when working with any machine learning model is to achieve accuracy. By striking a balance between these two sources of error(bias and variance), commonly known as the Bias-Variance tradeoff, we can enhance prediction accuracy. This article explores the definitions of bias and variance, delving into their functionalities within different models through the utilization of Python language.

What Exactly is Bias?

In the area of the machine learning domain, bias is demonstrated as a systematic error or deviation occurring in the predictions made by a model. It deviates from the actual values or ground truth and can lead to inaccurate or unjust predictions. This bias may stem from various sources encountered during the model’s training process. When we try to fit our model to solve a real-world problem, this error may occur. So, there are different situations like underfitting and overfitting in machine learning which are also known as high bias and low bias, respectively.

Underfitting (High Bias) Error in Machine Learning Models

High bias is also considered an underfitting condition in the machine learning model. This condition occurs when the model is too simple to handle real-world problems. The simple version of the model ignores some underlying patterns in the training data of real-world problems. This model does not understand the complexity of the data and the relation between input features and target outputs. It will not show great performance in both training and testing/ validation data.

There are a few tips to manage the underfitting in machine learning models, like maintaining the complexity of the model, and we can also increase the number of parameters in the model. The training data can be more versatile to handle the real-world problem effectively.

Overfitting (Low Bias) Error in Machine Learning Models

This type of error is exactly the opposite of underfitting (high bias), the model is too complex to process the real-world problem. This condition picks up noise or random fluctuations in the training data, which is not good for the accuracy of the model. To reduce the low bias error, it is recommended to use a simpler model and implement regularization techniques. This can be an effective approach. While training the model, a diversity of examples will help to minimize the low bias error.

Mathematical Formula of Bias

Predicted Output(Y) = Bias Term(b) + w1x1((Weight w1 associated with feature x1) + w2x2 + w3x3 +......+ wn*xn.

This formula is about linear regression. In this way, this bias term plays a very important role in the accuracy of the model. You can fit this equation according to the different models.

Concept of Variance in the Machine Learning Domain

In machine learning models, variance means how a model’s predictions react to alterations in the training data. It quantifies the fluctuation in the model’s output when varying datasets for training are employed. A high variance signifies that the model is excessively responsive to specific instances within the training data, potentially resulting in the inadequate ability to generalize and predict outcomes for new, unseen data. Let’s discuss the types of variance in machine learning models.

High Variance Error in Machine Learning Models

In this type of error, the model is complex as compared to real-world problems. In this situation, the model always works nicely on training data and validates the good results with higher accuracy but when the unseen data is trained and validated, it shows poor results with less accuracy. This type of error is considered a high variance in machine learning models. The characteristic of such a model is the low accuracy while testing on new datasets.

In the implementation of supervised learning, a model with high variance tightly fits the training data but struggles to make accurate predictions on unseen examples. This lack of generalization leads to poor performance on the test data. In the implementation of complex neural networks, one must be cautious. While a network with multiple layers and parameters may flawlessly fit the training data, it could stumble when faced with new data, lacking in generalization capability.

Low Variance Error in Machine Learning Models

The generalization of new data points can be achieved in this case of low variance. The low variance in machine learning models provides good results and accurate predictions. The main advantage of low variance is the simplicity of the model and the ability to capture the primary patterns of new data sets. If we apply models with low variance on any unseen dataset, it will predict good and accurate results. A simple linear regression model may exhibit low variance if the data accurately represents a linear relationship. Consequently, it will demonstrate satisfactory performance on both the training and test datasets.

Mathematical Formula of Variance

σ^2 (variance of dataset) = Σ (xi – μ)^2 / n. Here, xi is the single data point from the dataset, μ is the mean of the dataset, and n is the number of data points available in the dataset.

Difference Between Bias and Variance

In the domain of machine learning, there exist two distinct types of errors that can significantly impact the performance of a model: bias and variance. These bias and variance errors adhere to different characteristics. In the realm of model training, variance emerges as the outcome of a model being overly sensitive to fluctuations in the training data. This sensitivity often results in overfitting and an impeded ability to generalize effectively.

The objective of machine learning involves striking the right balance between bias and variance. The aim is to construct a model that effectively generalizes to new data while accurately capturing patterns within the training data. This delicate equilibrium is commonly known as the “bias-variance trade-off.” According to the analysis, to achieve the best model, low bias and low variance conditions are required. Let’s see a simple implementation to understand this concept.

Examples of Bias and Variance

In the given scenario, there is a dataset comprising two features. The objective is to classify the data into two distinct classes by utilizing a Support Vector Machine (SVM) classifier. To demonstrate this task, we will generate an artificial dataset using Scikit-learn’s make_classification function.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X, y = make_classification(n_samples=100, n_features=2, n_informative=2, n_redundant=0, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

y_train_pred = svm_model.predict(X_train)
y_test_pred = svm_model.predict(X_test)

train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print("Training Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)

In this example, a linear SVM classifier is utilized to separate the two classes in the dataset. Now, let’s explore an instance of bias by introducing class imbalance in the dataset. One class will be deliberately made more prevalent than the other.

In this second example, we added more samples of class 0 to make it more prevalent than class 1, creating a class imbalance. Now, if we run the code, we should observe that the test accuracy of the imbalanced model is higher than the original.

class1_indices = np.where(y == 0)[0]
class2_indices = np.where(y == 1)[0]

num_samples_to_add = 30
additional_samples_indices = np.random.choice(class1_indices, num_samples_to_add, replace=False)
X_imbalanced = np.vstack((X, X[additional_samples_indices]))
y_imbalanced = np.hstack((y, y[additional_samples_indices]))

X_train_imbalanced, X_test_imbalanced, y_train_imbalanced, y_test_imbalanced = train_test_split(
    X_imbalanced, y_imbalanced, test_size=0.3, random_state=42)
svm_model_imbalanced = SVC(kernel='linear')
svm_model_imbalanced.fit(X_train_imbalanced, y_train_imbalanced)

y_train_pred_imbalanced = svm_model_imbalanced.predict(X_train_imbalanced)
y_test_pred_imbalanced = svm_model_imbalanced.predict(X_test_imbalanced)

train_accuracy_imbalanced = accuracy_score(y_train_imbalanced, y_train_pred_imbalanced)
test_accuracy_imbalanced = accuracy_score(y_test_imbalanced, y_test_pred_imbalanced)

print("Training Accuracy (Imbalanced):", train_accuracy_imbalanced)
print("Test Accuracy (Imbalanced):", test_accuracy_imbalanced)

The introduced bias in this statement arises from the performance advantage of an imbalanced model on the test data, which is a result of artificial class imbalance. The model shows a preference for the majority class (class 0) and may not effectively generalize to real-world situations with balanced class distributions.

Bias And Variance Example For Imbalanced Class

In the example of class imbalance, it is important to note that bias can be presented in more diverse ways than this simple example suggests. Real-world scenarios present complexity as biases stem from diverse sources, as previously discussed. To ensure fairness and accuracy in predictions, it becomes crucial to address bias within machine learning models.

Summary

The different concepts related to bias and variance are explained in this article. The most important thing in machine learning models is the precision and accuracy of predictions. This can be maintained with the help of bias and variance tradeoffs. The simple balance between bias and variance can make a great difference. There are different types of bias and variance which are also explained in detail. In this article, the two examples are explained where the first one is with the original class and the second with the imbalanced class. Here, we can analyze the bias and variance importance in the machine learning models. Hope you will enjoy this article.

References

Do read the official documentation to understand the bias and variance in Python.