So, let us get started!! 🙂
Regression vs Classification in Machine Learning – Introduction
When we think of data science and analysis, Machine Learning has been playing an important role in modeling the data for predictions and analysis.
Machine Learning provides us with various algorithms that help model the data over the provided training and test dataset. There are two types of machine learning algorithms:
- Supervised Machine Learning algorithms: These algorithms work on labelled data and learn from the historic data fed to it, builds the model over it and then this model can be used for future predictions on the test data.
- Unsupervised Machine Learning algorithms: These algorithms unlike Supervised Learning algorithms do not learn from the historic data. Rather, they identify similar pattern/characteristics from live data and group them together as a category.
Talking specifically about Supervised machine Learning algorithms, they are further subdivided into the below types of algorithms:
- Regression: These algorithms work on the numeric data values. They perform predictions on the data set where the dependent variable or the target variable is a numeric data variable. Thus, the outcome of the prediction is also a numeric/continuous data variable. Some of the mostly used Regression ML Algorithms are Linear Regression, Decision Tree Regressor, Support Vector Regressor, etc.
- Classification: These kind of algorithms work on categorical data values that is the data variables that possess categorical data. It makes predictions on the dataset that happens to have a categorical dependent/target variable. Mostly used Classification ML Algorithms are Naïve Bayes, Logistic Regression, KNN, etc.
Having understood Regression and Classification in Supervised ML now will be discussing the key differences between them in the upcoming section.
As discussed above, Regression algorithms try to map continuous target variables to the various input variables from the dataset. It helps us predict the continuous integrated score/value for the requested calculations around the best fit line.
When we run any regression algorithm to evaluate the model, it is essential to have variants of solutions through which we can evaluate the credibility of the solution for the continuous prediction of numeric values.
Solution 01: VARIANCE
With Regression, the target data variable has a connection established with the independent variables. Variance enables us to test the change in the estimation of the target data variable with any kind of change in the training data variables from the partitioned dataset.
Usually, for any training data value, the ideal outcome of the model should give the same results. That is it should exhibit a minimum variance score. Thus for any regression model/algorithm, we make sure that the variance score is as low as possible.
Solution 02: BIAS
In simple language, Bias represents the possibility of the regression algorithm to adapt and learn the incorrect data values without even taking all the data into consideration. For any model to have better results, it is essential for them to have a low bias score. Usually, bias has a high value when the data has missing values or outliers in the dataset.
In the end, when it comes to regression algorithms the entire scenario is surrounded around the concept of the best fit line. Yes, the regression models try to fit the line between the predictions and actual data scores.
As discussed above, Classification type algorithms enable us to work on the categorical types of data values at ease. We predict a label of class from various sets of class (data variables).
With reference to classification, there exist various types of Classification tasks some of which are mentioned below-
- Binary Classification – In these type of scenarios, the dataset contains the dependent variables to have two labels. That is the classification model gets tested against two categorical labels. For example, a recommendation system to check if the emails are SPAM or NOT SPAM, a portal to check if the student with particular ID is PRESENT or ABSENT, etc. We can make use of Logistic Regression, Decision Trees, etc to solve binary classification problems.
- Multi-Class Classification – As the name suggests, a multi-class classification algorithm contains datasets with more than two categorical labels as the dependent variable. Unlike Binary Classification, here the labels are not binary rather they belong to a range of labels expected. For example, Animal or plant species recognition, human face classification based on more than two attributes, etc.
- Imbalanced Classification – In this type of classification, the count of examples that belong to every category or class label is unequally distributed. For example, consider a medical diagnosis dataset which contains data of people diagnosed with malaria v/s people unaffected with it. In this scenario, consider more than 80% training data contains elements that states people have malaria. This scenario or type of classification problem is known as Imbalance classification problem. Here, there is an unequal difference between the types of labels. We can make use of SMOTE or Random Oversampling to solve such type of problems.
Difference 1: Behavior of the resultant value
Once we are done with the predictions, for the Regression type of data, the prediction results are continuous in nature. That is, the data values predicted are numeric in nature.
On the other hand, post predictions, the type of the resultant for Classification algorithms is categorical in nature. They result in some groups or categories.
Difference 2: Evaluation(Error estimation) of the model
Post prediction, it is essential for us to apply certain metrics to check for the accuracy of the model.
For the same, with Regression Algorithms, we make use of MAPE, R-square, etc to measure the error estimation of the model. On the other hand, for classification algorithms, we mostly make use of Recall, Confusion Matrix, F-1 score, etc to estimate the accuracy of the model.
Difference 3: Method of Prediction
For the prediction of the data values against the historic data, Regression algorithms make use of the best fit line to estimate and predict the closest continuous data value for the data set.
The classification algorithms use decision boundaries to detect the boundary of the cluster formed as a combination of points with similar characteristics. This helps in identifying the input data against different categories.
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any questions.
For more such posts related to Python programming, Stay tuned with us.
Till then, Happy Learning!! 🙂