Hello, folks! In this article, we will be focusing on **3 important techniques to Impute missing data values** in Python.

So, let us begin.

## Why do we need to impute missing data values?

Before going ahead with imputation, let us understand what is a missing value.

So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.

Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:

**Reduces the efficiency**of the ML model.**Affects the overall distribution**of data values.- It leads to a
**biased effect**in the estimation of the ML model.

This is when imputation comes into picture.

By imputation, we mean to replace the missing or null values with a particular value in the entire dataset.

Imputation can be done using any of the below techniques–

**Impute by mean****Impute by median****Knn Imputation**

Let us now understand and implement each of the techniques in the upcoming section.

## 1. Impute missing data values by MEAN

The missing values can be imputed with the mean of that particular feature/data variable. That is, the null or missing values can be replaced by the mean of the data values of that particular data column or dataset.

**Let us have a look at the below dataset which we will be using throughout the article.**

As clearly seen, the above dataset contains NULL values. Let us now try to impute them with the mean of the feature.

### Import the required libraries

Here, at first, let us load the necessary datasets into the working environment.

#Load libraries import os import pandas as pd import numpy as np

We have used pandas.read_csv() function to load the dataset into the environment.

marketing_train = pd.read_csv("C:/marketing_tr.csv")

### Verify missing values in the database

Before we imputing missing data values, it is necessary to check and detect the presence of missing values using `isnull() function`

as shown below–

marketing_train.isnull().sum()

After executing the above line of code, we get the following count of missing values as output:

custAge 1804 profession 0 marital 0 responded 0 dtype: int64

As clearly seen, the data variable ‘custAge’ contains 1804 missing values out of 7414 records.

### Use the mean() method on all the null values

Further, we have used `mean() function`

to impute all the null values with the mean of the column ‘custAge’.

missing_col = ['custAge'] #Technique 1: Using mean to impute the missing values for i in missing_col: marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].mean()

### Verify the changes

After performing the imputation with mean, let us check whether all the values have been imputed or not.

marketing_train.isnull().sum()

As seen below, all the missing values have been imputed and thus, we see no more missing values present.

custAge 0 profession 0 marital 0 responded 0 dtype: int64

## 2. Imputation with median

In this technique, we impute the missing values with the median of the data values or the data set.

Let us understand this with the below example.

**Example:**

#Load libraries import os import pandas as pd import numpy as np marketing_train = pd.read_csv("C:/marketing_tr.csv") print("count of NULL values before imputation\n") marketing_train.isnull().sum() missing_col = ['custAge'] #Technique 2: Using median to impute the missing values for i in missing_col: marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].median() print("count of NULL values after imputation\n") marketing_train.isnull().sum()

Here, we have imputed the missing values with median using `median() function`

.

**Output:**

count of NULL values before imputation custAge 1804 profession 0 marital 0 responded 0 dtype: int64 count of NULL values after imputation custAge 0 profession 0 marital 0 responded 0 dtype: int64

## 3. KNN Imputation

In this technique, the missing values get imputed based on the KNN algorithm i.e. **K-nearest-neighbour algorithm**.

In this algorithm, the missing values get replaced by the nearest neighbor estimated values.

Let us understand the implementation using the below example:

**KNN Imputation:**

#Load libraries import os import pandas as pd import numpy as np marketing_train = pd.read_csv("C:/marketing_tr.csv") print("count of NULL values before imputation\n") marketing_train.isnull().sum()

Here, is the count of missing values:

count of NULL values before imputation custAge 1804 profession 0 marital 0 responded 0 dtype: int64

In the below piece of code, we have converted the data types of the data variables to object type with categorical codes assigned to them.

lis = [] for i in range(0, marketing_train.shape[1]): if(marketing_train.iloc[:,i].dtypes == 'object'): marketing_train.iloc[:,i] = pd.Categorical(marketing_train.iloc[:,i]) #print(marketing_train[[i]]) marketing_train.iloc[:,i] = marketing_train.iloc[:,i].cat.codes marketing_train.iloc[:,i] = marketing_train.iloc[:,i].astype('object') lis.append(marketing_train.columns[i])

The `KNN() function`

is used to impute the missing values with the nearest neighbour possible.

#Apply KNN imputation algorithm marketing_train = pd.DataFrame(KNN(k = 3).fit_transform(marketing_train), columns = marketing_train.columns)

**Output of imputation**:

Imputing row 1/7414 with 0 missing, elapsed time: 13.293 Imputing row 101/7414 with 1 missing, elapsed time: 13.311 Imputing row 201/7414 with 0 missing, elapsed time: 13.319 Imputing row 301/7414 with 0 missing, elapsed time: 13.319 Imputing row 401/7414 with 0 missing, elapsed time: 13.329 . . . . . Imputing row 7101/7414 with 1 missing, elapsed time: 13.610 Imputing row 7201/7414 with 0 missing, elapsed time: 13.610 Imputing row 7301/7414 with 0 missing, elapsed time: 13.618 Imputing row 7401/7414 with 0 missing, elapsed time: 13.618

print("count of NULL values after imputation\n") marketing_train.isnull().sum()

**Output:**

count of NULL values before imputation custAge 0 profession 0 marital 0 responded 0 dtype: int64

## Conclusion

By this, we have come to the end of this topic. In this article, we have implemented 3 different techniques of imputation.

Feel free to comment below, in case you come across any question.

For more such posts related to Python, Stay tuned @ Python with AskPython and Keep Learning!