Feature Engineering in Machine Learning

If you are new to machine learning, you must have heard about feature engineering and thought, “What is feature engineering?” “It sure sounds like something complicated; I’ll save it for later.” But let me tell you, it just sounds complicated, but it’s very simple, and it’ll take only a few minutes for you to completely understand it. So let’s get into it.

Before learning about feature engineering, we have to understand what machine learning is.

What is Machine Learning?

Machine learning has gained massive popularity in recent times. With the launch of these new powerful AI tools like ChatGPT and more, machine learning has caught the eye of a lot of people. We have heard about machine learning for years, but for the first time, we are seeing it in action.

Machine learning is the process of training a model with the help of a lot of data, based on which the model will make predictions. Now, these data are fed to the model in the form of a dataset. A dataset is a collection of data in the form of a table. It is made up of rows and columns. When dealing with big datasets, we have to make sure that our dataset can contribute the most to making positive predictions.

Types of Data

In machine, learning data is mainly divided into two parts:-

Structured data

The data which is in a particular format that makes it easier to work on is called structured data. It usually has rows and columns. It is often stored in RDBMS (Relational database Management System) or CSV files. It is easier to work on and can be a lot helpful in the creation of a machine learning model.

Unstructured data

Unstructured data is raw data that hasn’t been processed. It can contain audio files, text data, or image data. It doesn’t have a pre-defined manner. They need to be converted into structured data before we can start making our model on it. It is not possible to make a machine-learning model on unstructured data. To convert unstructured data into structured data, feature engineering is used.

What is Feature Engineering?

Let’s assume we have some independent variables in our dataset. These variables are also known as “features.” Now we can generate new features based on these existing features in our dataset. This process of generating new features from the ones that we already have is called feature engineering.

In formal language, “feature engineering” is the science of extracting more and more useful data from already existing data in our dataset.

Why do we need Feature Engineering?

The dataset in its raw form may not always be appropriate to train our model on. So we need to improve our dataset so our model can have the maximum accuracy and give the best possible results. Feature engineering helps you make this possible. It helps you make your dataset more appropriate to train your ML model on.

Before starting with feature engineering, we have to ensure that the features present in the dataset, for now, are all necessary for our model. It can be that some of the features are redundant or maybe some impact the prediction capabilities of our model in a negative way.

How to perform Feature Engineering?

Feature Creation

When you create new features by recognizing patterns in pre-existing features, it is known as feature creation. It is an important aspect of feature engineering. A lot of times the data available may give you some hints about patterns. You can recognize these patterns and create new features based on them.

Feature Selection

Feature selection is an essential part of feature engineering. It is the process where we check if a feature is good or bad for our model. After checking we exclude the features from the dataset which are not good for our model. Even if all the features are good for our model, sometimes we want only a few best. It may be due to time constraints or to avoid the problem of overfitting the model.

Related: Learn how to deal with overfitting.

Feature selection can be done in two ways:-

Forward selection

In this method, we start with an empty dataset. We add a feature to it and evaluate its R2 score of it. If the R2 score increases, we keep to feature else we remove it. We repeat this process for all the features.

Backward elimination

In this method, we start by including the entire dataset. We remove a feature from it and evaluate its R2 score of it. If the R2 score increases, we permanently remove the feature else we keep it. We repeat this process for all the features.

Related: Feature Selection

Feature Scaling

If you have more than one variable, they don’t need to have value ranges of equal magnitude. They may have a different domain. Due to this, our model may give more importance to a feature that carries very less importance. This will increase the inaccuracy of the model. To solve this problem we need to bring the data to the same scale. For this, we do Feature Scaling. Feature Scaling can be done using various methods but normalization and standardization are the most common ways to scale the data. You can learn more about them in the link mentioned below.

Related: Feature Scaling

Dealing with Missing data

Missing data in a dataset is generally represented by NA or null values. A lot of null values can lead the model in the wrong direction. They need to be dealt with. So we drop the columns that have a lot of null values. If there are not that much of null values in a feature, we can replace the null values with the mean of the remaining values in the feature or we can impute the entire record. Though imputation is not a preferred method in some cases as it may lead to loss of data.

Feature transformation

Feature transformation is a method to transform the features of the dataset using mathematical operations. Some models like linear models perform well when we have a normal distribution of data. In case of skewed data, we can use feature transformation to transform data and normalize it.

For right-skewed data, we can use nth root or log transformation and for right-skewed data, we use nth power or exponential transformation.

Feature encoding

Many ML models like Linear Regression doesn’t accept categorical variable as an independent variable. In those cases, we must transform the categorical variable into a numerical variable to make it useful. For that purpose, we use a categorical encoding. Categorical encoding is an essential feature engineering method. There are mainly two ways in which we can perform categorical encoding.

One hot encoding / Dummy encoding

In Dummy encoding, we create a separate variable for each category in the categorical variable. These new variables will contain boolean values(0 or 1). For example, we have a categorical variable “Result” consisting of two values, “Yes” or “No”. Now, we will replace this variable with two variables – named “Result_Yes” and “Result_No”.In an instance of the dataset, if the value of “Result” was “Yes” we will put 1 in the “Result_Yes” variable at that instance and 0 in the “Result_No” variable. Now let’s see how we can do this in Python.

Dummy encoding implementation in Python

We just have to use to get_dummies function of pandas for this purpose.

import pandas as pd
data = {"Marks":[99,31,23,81,63],"Qualified":["Yes","No","No","Yes","Yes"]}
df = pd.DataFrame(data)
df = pd.get_dummies(df,columns = ['Qualified'], drop_first = True)
df

In the above code first, we imported the pandas library. Then we created a dictionary with Marks and Qualified as keys, passed the marks, and set the values of the Qualified as follows: Records with marks>=35 have a Qualified value as “Yes” else it will have a “No” value. Then we converted the dictionary to a dataframe. Now we used the get_dummies function of pandas to replace the “Qualified” variable with dummy variables.

So now you can see that the Qualified column has been removed. But there’s only one dummy variable – Qualified_Yes, Why?

Think about it. The Qualified_Yes variable has value 1 where there was Yes in the Qualified column. That means, wherever there’s a No in Qualified there must be a 0 in Qualified_Yes and a 1 in Qualified _No. So no information is lost. We can retrieve all the information previously present with the Qualified_Yes variable only. Adding the Qualified_No variable would be unnecessary and redundant.

Label encoding

In Label encoding, for every category in the variable, we assign a number and then replace the categories with that number. For example, we have a categorical variable “condition” accepting values “Excellent”, “Good”, “Okay” and “Bad”. In this variable, there’s an order among the categories.

Excellent>Good>Okay>Bad

So, we can denote excellent with 4, good with 3, okay with 2, and bad with 1. Now, replace all the instances with their corresponding numeric values. Now, we got a numeric column with values 1-4 instead of a categorical column.

Label encoding implementation in Python

Let’s create a dataframe using the pandas library. To create a dataframe go through the following steps

Import pandas library.
Create a dictionary with the keys as the name of the columns and values as a list of values in the corresponding column.
Now use the DataFrame function from the pandas library to convert this dictionary into a dataframe.

import pandas as pd
data = {"Marks":[99,65,73,81,63],"Result":["excellent","bad","okay","good","bad"]}
df = pd.DataFrame(data)

Code And Output For A Converting A Dictionary In A Dataframe

In the above code block, first, we imported the pandas library as pd. Then we created a dictionary with Marks and Result as keys, passed the marks, and set the values of the result as follows: Records with marks>=90 have a Result value as “excellent”, 90>marks>=80 has a Result value of “good”, 80>marks>=70 has Result value as “okay” and marks<70 has Result value as “bad”. Now we have to perform label encoding on the “Result” variable. For this, we will use the map function of dataframe.

mapping = {"excellent": 4, "good": 3, "okay": 2, "bad": 1}
df_encoded = df
df_encoded['Result'] = df['Result'].map(mapping) 
df_encoded

For label encoding, we created a dictionary of all the values of Result as key and the numeric value with which we have to replace it will be the corresponding values of the keys. Then we used the map function on the Result variable and passed the mapping dictionary as the argument. This will change each record in the Result column with the corresponding value of the dictionary as mentioned in it.

Note: Only use label encoding when you know the order among different levels.

One hot encoding is an important part of feature engineering. We can’t cover everything right here. Check out one hot encoding in depth.

Binning

Consider a categorical variable that has 100 categories. If we try to create dummy variables for this categorical variable, there will be 100 new columns in the dataset. When we perform dummy encoding, we can reduce the no. of categories using binning. This way, we can reduce the no. of dummy variables.

Note: We can also bin continuous variables but it is advised not to as it leads to loss of information.

Feature Generation

Feature generation is the most important part of feature engineering. We can use feature generation to great advantage. Let’s understand this with the help of an example.

Suppose you’re making a machine-learning model for traffic prediction. You have the date variable in the dataset. The date variable by itself doesn’t contribute much to our model. Now, we can create a new feature ‘day’ using this date variable by finding out which day it was on the given date. Now this day variable is awesome. The model will be able to notice the pattern in the traffic based on the day as we know traffic varies based on the day.

For example, there will be less traffic on Sunday as it is a holiday. We just generated a new and useful feature from another pre-existing feature that wasn’t able to contribute much to our model. This way we can use feature generation to make our model better than before.

Combination of features

The combination of features is the process of creating a feature based on two or more different features which will replace the rest. An efficient combination of two or more features can result in improved prediction power of our model.

For example, we are making a machine learning model to predict the sale price of a house. The dataset has a waterfront view column and a backyard column. We can combine these two feature and make a new feature luxury home and the records with both waterfront view and backyard as 1 we will put luxury home as 1. The records with any one of them as 1 will have a value of a luxury home as 0.5 and where both are 0 will have a value of 0. Now, we can remove those columns. This way we reduced the dimension of our dataset and also kept the information present.

Feature Engineering for specific data types

Time series data

Time series data is a type of data that is collected over time. For example, stock market prices and global warming rates. In this type the chronology is important. Let’s consider the example of the stock market price. The data in this dataset can’t just be stock prices on random dates. Stock prices on random dates won’t be of any use. We need to find patterns in the variation of stock prices. So we need to maintain chronology. This type of data where chronology carries importance is called time series data.

Feature engineering in time series data is important for data analysis and modeling. Feature engineering in time series data includes wavelet transform, Fourier transform, rolling features, lag features, seasonal features, etc. There are libraries available in Python for feature engineering in time series data. tsfresh is the most commonly used. It contains functions to extract more than 1000 features.

Audio data

Creating a machine-learning model for Audio data is intricate. Working with audio data makes the model-creation process a lot more tedious. For audio data, it’s really important you go through the feature engineering process.

Feature engineering in audio data includes creating pitch rhythm features, spectral features, and spectrogram features where we convert the audio signal into spectrogram images. You can use PyAudioAnalysis or Librosa library for this purpose.

Text data

Machine learning models on text data are pervasive. Sentiment analysis, fake news detection, and much more important stuff have been possible thanks to text processing. Text data contains of a lot of data that we know may not be needed for the prediction process and may act as a barrier to accurate predictions. For example, most of the text analysis doesn’t require stopwords, punctuations, or emojis. You need to remove them in order to increase the accuracy of our model. NLTK (natural language toolkit) is the most common and useful library for text processing out there. It contains a lot of functions that help in feature engineering.

Conclusion

Feature engineering is an essential topic in machine learning. It’s not that complicated but without feature engineering, your model can’t be in its best form. It is a tedious process but not going through it is a bad idea. Make sure you always check if your dataset needs any feature engineering or not. Only after that proceed towards training your model.