Feature Selection in Python - A Beginner's Reference

This article is a little on the advanced side. We’ll discuss feature selection in Python for training machine learning models. It’s important to identify the important features from a dataset and eliminate the less important features that don’t improve model accuracy.

Model performance can be harmed by features that are irrelevant or only partially relevant. The first and most critical phase in model design should be feature selection and data cleaning.

Feature selection is a fundamental concept in machine learning that has a significant impact on your model’s performance. In this article, you’ll learn how to employ feature selection strategies in Machine Learning.

Also read: Machine Learning In Python – An Easy Guide For Beginner’s

Let’s get started!

First of all, let us understand what is Feature Selection.

What is Feature Selection?

The presence of irrelevant features in your data can reduce model accuracy and cause your model to train based on irrelevant features. Feature selection is the process of selecting the features that contribute the most to the prediction variable or output that you are interested in, either automatically or manually.

Why should we perform Feature Selection on our Model?

Following are some of the benefits of performing feature selection on a machine learning model:

Improved Model Accuracy: Model accuracy improves as a result of less misleading data.
Reduced Overfitting: With less redundant data, there is less chance of making conclusions based on noise.
Reduced Training Time: Algorithm complexity is reduced as a result of fewer data points, and algorithms train faster.

When you conduct feature selection on a model, its accuracy improves dramatically.

Also read: How to Split Data into Training and Testing Sets in Python using sklearn?

Methods to perform Feature Selection

There are three commonly used Feature Selection Methods that are easy to perform and yield good results.

Univariate Selection
Feature Importance
Correlation Matrix with Heatmap

Let’s take a closer look at each of these methods with an example.

Link to download the dataset: https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv

1. Univariate Selection

Statistical tests can be performed to identify which attributes have the strongest link to the output variable. The SelectKBest class in the scikit-learn library can be used with a variety of statistical tests to choose a certain number of features.

The chi-squared (chi2) statistical test for non-negative features is used in the example below to select 10 of the top features from the Mobile Price Range Prediction Dataset.

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv")
X = data.iloc[:,0:20]  #independent variable columns
y = data.iloc[:,-1]    #target variable column (price range)

#extracting top 10 best features by applying SelectKBest class
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

#concat two dataframes
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(10,'Score'))  #printing 10 best features

Output:

Specs          Score
13            ram  931267.519053
11      px_height   17363.569536
0   battery_power   14129.866576
12       px_width    9810.586750
8       mobile_wt      95.972863
6      int_memory      89.839124
15           sc_w      16.480319
16      talk_time      13.236400
4              fc      10.135166
14           sc_h       9.614878

2. Feature Importance

The feature importance attribute of the model can be used to obtain the feature importance of each feature in your dataset.

Feature importance assigns a score to each of your data’s features; the higher the score, the more important or relevant the feature is to your output variable. We will use Extra Tree Classifier in the below example to extract the top 10 features for the dataset because Feature Importance is an inbuilt class that comes with Tree-Based Classifiers.

import pandas as pd
import numpy as np
data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv")
X = data.iloc[:,0:20]  #independent variable columns
y = data.iloc[:,-1]    #target variable column (price range)
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X,y)
print(model.feature_importances_) 

#plot the graph of feature importances 
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

Output:

[0.05945479 0.02001093 0.03442302 0.0202319  0.03345326 0.01807593
 0.03747275 0.03450839 0.03801611 0.0335925  0.03590059 0.04702123
 0.04795976 0.38014236 0.03565894 0.03548119 0.03506038 0.01391338
 0.01895962 0.02066298]

3. Correlation Statistics with Heatmap

Correlation describes the relationship between the features and the target variable.
Correlation can be:

Positive: An increase in one feature’s value improves the value of the target variable or
Negative: An increase in one feature’s value decreases the value of the target variable.

We will plot a heatmap of correlated features using the Seaborn library to find which features are most connected to the target variable.

import pandas as pd
import numpy as np
import seaborn as sns
data = pd.read_csv("C://Users//Intel//Documents//mobile_price_train.csv")
X = data.iloc[:,0:20]  #independent variable columns
y = data.iloc[:,-1]    #targetvariable column (price range)

#obtain the correlations of each features in dataset
corrmat = data.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(20,20))
#plot heat map
g=sns.heatmap(data[top_corr_features].corr(),annot=True,cmap="RdYlGn")

Output:

Go to the last row and look at the price range. You will see all the features correlated to the price range. ‘ram’ is the feature that is highly correlated to the price range, followed by features such as battery power, pixel height, and width.m_dep, clock_speed, and n_cores are the features least correlated with the price range.

Conclusion

We learned how to choose relevant features from data using the Univariate Selection approach, feature importance, and the correlation matrix in this article. Choose the method that suits your case the best and use it to improve your model’s accuracy.