This article is a little on the advanced side. We’ll discuss feature selection in Python for training machine learning models. It’s important to identify the important features from a dataset and eliminate the less important features that don’t improve model accuracy.
Model performance can be harmed by features that are irrelevant or only partially relevant. The first and most critical phase in model design should be feature selection and data cleaning.
Feature selection is a fundamental concept in machine learning that has a significant impact on your model’s performance. In this article, you’ll learn how to employ feature selection strategies in Machine Learning.
Let’s get started!
First of all, let us understand what is Feature Selection.
What is Feature Selection?
The presence of irrelevant features in your data can reduce model accuracy and cause your model to train based on irrelevant features. Feature selection is the process of selecting the features that contribute the most to the prediction variable or output that you are interested in, either automatically or manually.
Why should we perform Feature Selection on our Model?
Following are some of the benefits of performing feature selection on a machine learning model:
- Improved Model Accuracy: Model accuracy improves as a result of less misleading data.
- Reduced Overfitting: With less redundant data, there is less chance of making conclusions based on noise.
- Reduced Training Time: Algorithm complexity is reduced as a result of fewer data points, and algorithms train faster.
When you conduct feature selection on a model, its accuracy improves dramatically.
Methods to perform Feature Selection
There are three commonly used Feature Selection Methods that are easy to perform and yield good results.
- Univariate Selection
- Feature Importance
- Correlation Matrix with Heatmap
Let’s take a closer look at each of these methods with an example.
Link to download the dataset: https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv
1. Univariate Selection
Statistical tests can be performed to identify which attributes have the strongest link to the output variable. The SelectKBest class in the scikit-learn library can be used with a variety of statistical tests to choose a certain number of features.
The chi-squared (chi2) statistical test for non-negative features is used in the example below to select 10 of the top features from the Mobile Price Range Prediction Dataset.
import pandas as pd import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv") X = data.iloc[:,0:20] #independent variable columns y = data.iloc[:,-1] #target variable column (price range) #extracting top 10 best features by applying SelectKBest class bestfeatures = SelectKBest(score_func=chi2, k=10) fit = bestfeatures.fit(X,y) dfscores = pd.DataFrame(fit.scores_) dfcolumns = pd.DataFrame(X.columns) #concat two dataframes featureScores = pd.concat([dfcolumns,dfscores],axis=1) featureScores.columns = ['Specs','Score'] #naming the dataframe columns print(featureScores.nlargest(10,'Score')) #printing 10 best features
Specs Score 13 ram 931267.519053 11 px_height 17363.569536 0 battery_power 14129.866576 12 px_width 9810.586750 8 mobile_wt 95.972863 6 int_memory 89.839124 15 sc_w 16.480319 16 talk_time 13.236400 4 fc 10.135166 14 sc_h 9.614878
2. Feature Importance
The feature importance attribute of the model can be used to obtain the feature importance of each feature in your dataset.
Feature importance assigns a score to each of your data’s features; the higher the score, the more important or relevant the feature is to your output variable. We will use Extra Tree Classifier in the below example to extract the top 10 features for the dataset because Feature Importance is an inbuilt class that comes with Tree-Based Classifiers.
import pandas as pd import numpy as np data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv") X = data.iloc[:,0:20] #independent variable columns y = data.iloc[:,-1] #target variable column (price range) from sklearn.ensemble import ExtraTreesClassifier import matplotlib.pyplot as plt model = ExtraTreesClassifier() model.fit(X,y) print(model.feature_importances_) #plot the graph of feature importances feat_importances = pd.Series(model.feature_importances_, index=X.columns) feat_importances.nlargest(10).plot(kind='barh') plt.show()
[0.05945479 0.02001093 0.03442302 0.0202319 0.03345326 0.01807593 0.03747275 0.03450839 0.03801611 0.0335925 0.03590059 0.04702123 0.04795976 0.38014236 0.03565894 0.03548119 0.03506038 0.01391338 0.01895962 0.02066298]
3. Correlation Statistics with Heatmap
Correlation describes the relationship between the features and the target variable.
Correlation can be:
- Positive: An increase in one feature’s value improves the value of the target variable or
- Negative: An increase in one feature’s value decreases the value of the target variable.
We will plot a heatmap of correlated features using the Seaborn library to find which features are most connected to the target variable.
import pandas as pd import numpy as np import seaborn as sns data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv") X = data.iloc[:,0:20] #independent variable columns y = data.iloc[:,-1] #targetvariable column (price range) #obtain the correlations of each features in dataset corrmat = data.corr() top_corr_features = corrmat.index plt.figure(figsize=(20,20)) #plot heat map g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")
Go to the last row and look at the price range. You will see all the features correlated to the price range. ‘ram’ is the feature that is highly correlated to the price range, followed by features such as battery power, pixel height, and width.m_dep, clock_speed, and n_cores are the features least correlated with the price range.
We learned how to choose relevant features from data using the Univariate Selection approach, feature importance, and the correlation matrix in this article. Choose the method that suits your case the best and use it to improve your model’s accuracy.