Select the Best Machine Learning Model Features with Python

Feature Selection

All machine learning enthusiasts have a frustrating time with which parameters to use to make their models effective. Bad parameters or features can potentially ruin the model and can also have adverse consequences. So how do we tackle this problem?

Feature Selection is your answer. Feature selection is one of the most essential steps in building machine learning models. Good feature selection provides you with much more accurate and efficient models.

In this article, we will learn what Feature selection is, learn a little about Sklearn, and then find out how to do the same using Sklearn using the SelectKBest attribute.

Recommended: Feature Engineering in Machine Learning

Why Feature Selection Matters

As mentioned earlier, feature selection is about choosing the right parameters for your model. One of the major advantages is to reduce overfitting of the data. Overfitting means that the model will give you a good prediction for the present data but will be inaccurate for the new data points. Another advantage is that it will provide you with a simpler model as there will be reduced parameters. Reduced parameters will also increase the efficiency of the model and lower computational cost as well.

In the SelectKBest method that we will use, we will apply the Chi-Squared method test. The chi-squared test is simple and determines if there is a significant relationship between two variables. In the context of feature selection, from the given data, a chi-squared test will be applied to our target variable and independent variable. This test will be performed for all the variables and a higher chi-squared test score will indicate that the independent variable is a good candidate for our machine-learning model.

Chi Square Test
Chi-Square Test

Let us move and learn a little about the Sklearn library.

Introducing Scikit-Learn

Scikit-learn or Sklearn is a library in Python programming language. It essentially helps us to build different types of machine learning models like linear and logistic regression. It simplifies the computation of these models tremendously with its modules.

Finally, let us now move on and understand how feature selection is done using the SelectKBest attribute of the Sklearn library.

Selecting Best Features with SelectKBest

In the first method, we used SelectKBest from Sklearn for feature selection. We have created 5 sample features Temperature, Humidity, Wind Speed, Pressure, and Cloud Cover. Using the attribute, we will select 3 features out of 5 by using the chi-squared selector. Let us see the code to understand it further.

The code first generates some random sample data to use for demonstration. The np.random.rand function creates a 10×5 array with random floats between 0-1 for the feature data (X). Then a random target array (y) is created with 10 values of either 0 or 1 using np.random.randint. This represents generic randomized feature data with a binary target, as you would commonly have in a machine learning classification task. We use this fabricated data just for illustration – in practice you would use your actual training data features and targets

import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Sample weather data (replace with your actual data)
X = np.random.rand(10, 5)  # 10 samples, 5 features
y = np.random.randint(0, 2, size=10)  # Target variable (0 or 1)

# Feature names (replace with meaningful names based on your data)
feature_names = ["Temperature", "Humidity", "Wind Speed", "Pressure", "Cloud Cover"]

# Create a chi-squared selector object to select top 3 features
selector = SelectKBest(chi2, k=3)

# Fit the selector on the data
X_new = selector.fit_transform(X, y)

# Get the names of the selected features
selected_features = selector.get_feature_names_out(feature_names)

print("Original features:", feature_names)
print("Selected features:", selected_features)

# Print the chi-squared scores for all features
print("Chi-squared scores:", selector.scores_)

Let us look at the output of the code below.

Feature Selection Using SelectKBest
Feature Selection Using SelectKBest

Hence, we can see that we have selected Humidity, Pressure, and Cloud Cover as the features due to a higher Chi-Squared Score.

Also read: 5 Machine Learning Models with Python Examples

Conclusion

Here you go! Now you know how to select parameters for your machine learning models to improve their accuracy and make them much more efficient. In this article, we learned about the concept of feature selection, and what chi-squared test is used, and then ultimately we learned to code a simple feature selection program using the SelectkBest attribute of the Sklearn library.

hope you enjoyed it!!

Recommended: Supervised Machine Learning With Python: How To Get Started!