Hello, folks! In this article, we will be focusing on **3 important techniques to Impute missing data values** in Python.

So, let us begin.

## Why do we need to impute missing data values?

Before going ahead with imputation, let us understand what is a missing value.

So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.

Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:

**Reduces the efficiency**of the ML model.**Affects the overall distribution**of data values.- It leads to a
**biased effect**in the estimation of the ML model.

This is when imputation comes into picture.

By imputation, we mean to replace the missing or null values with a particular value in the entire dataset.

Imputation can be done using any of the below techniques–

**Impute by mean****Impute by median****Knn Imputation**

Let us now understand and implement each of the techniques in the upcoming section.

## 1. Impute missing data values by MEAN

The missing values can be imputed with the mean of that particular feature/data variable. That is, the null or missing values can be replaced by the mean of the data values of that particular data column or dataset.

**Let us have a look at the below dataset which we will be using throughout the article.**

As clearly seen, the above dataset contains NULL values. Let us now try to impute them with the mean of the feature.

### Import the required libraries

Here, at first, let us load the necessary datasets into the working environment.

```
#Load libraries
import os
import pandas as pd
import numpy as np
```

We have used pandas.read_csv() function to load the dataset into the environment.

```
marketing_train = pd.read_csv("C:/marketing_tr.csv")
```

### Verify missing values in the database

Before we imputing missing data values, it is necessary to check and detect the presence of missing values using `isnull() function`

as shown below–

```
marketing_train.isnull().sum()
```

After executing the above line of code, we get the following count of missing values as output:

```
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
```

As clearly seen, the data variable ‘custAge’ contains 1804 missing values out of 7414 records.

### Use the mean() method on all the null values

Further, we have used `mean() function`

to impute all the null values with the mean of the column ‘custAge’.

```
missing_col = ['custAge']
#Technique 1: Using mean to impute the missing values
for i in missing_col:
marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].mean()
```

### Verify the changes

After performing the imputation with mean, let us check whether all the values have been imputed or not.

```
marketing_train.isnull().sum()
```

As seen below, all the missing values have been imputed and thus, we see no more missing values present.

```
custAge 0
profession 0
marital 0
responded 0
dtype: int64
```

## 2. Imputation with median

In this technique, we impute the missing values with the median of the data values or the data set.

Let us understand this with the below example.

**Example:**

```
#Load libraries
import os
import pandas as pd
import numpy as np
marketing_train = pd.read_csv("C:/marketing_tr.csv")
print("count of NULL values before imputation\n")
marketing_train.isnull().sum()
missing_col = ['custAge']
#Technique 2: Using median to impute the missing values
for i in missing_col:
marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].median()
print("count of NULL values after imputation\n")
marketing_train.isnull().sum()
```

Here, we have imputed the missing values with median using `median() function`

.

**Output:**

```
count of NULL values before imputation
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
count of NULL values after imputation
custAge 0
profession 0
marital 0
responded 0
dtype: int64
```

## 3. KNN Imputation

In this technique, the missing values get imputed based on the KNN algorithm i.e. **K-nearest-neighbour algorithm**.

In this algorithm, the missing values get replaced by the nearest neighbor estimated values.

Let us understand the implementation using the below example:

**KNN Imputation:**

```
#Load libraries
import os
import pandas as pd
import numpy as np
marketing_train = pd.read_csv("C:/marketing_tr.csv")
print("count of NULL values before imputation\n")
marketing_train.isnull().sum()
```

Here, is the count of missing values:

```
count of NULL values before imputation
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
```

In the below piece of code, we have converted the data types of the data variables to object type with categorical codes assigned to them.

```
lis = []
for i in range(0, marketing_train.shape[1]):
if(marketing_train.iloc[:,i].dtypes == 'object'):
marketing_train.iloc[:,i] = pd.Categorical(marketing_train.iloc[:,i])
#print(marketing_train[[i]])
marketing_train.iloc[:,i] = marketing_train.iloc[:,i].cat.codes
marketing_train.iloc[:,i] = marketing_train.iloc[:,i].astype('object')
lis.append(marketing_train.columns[i])
```

The `KNN() function`

is used to impute the missing values with the nearest neighbour possible.

```
#Apply KNN imputation algorithm
marketing_train = pd.DataFrame(KNN(k = 3).fit_transform(marketing_train), columns = marketing_train.columns)
```

**Output of imputation**:

```
Imputing row 1/7414 with 0 missing, elapsed time: 13.293
Imputing row 101/7414 with 1 missing, elapsed time: 13.311
Imputing row 201/7414 with 0 missing, elapsed time: 13.319
Imputing row 301/7414 with 0 missing, elapsed time: 13.319
Imputing row 401/7414 with 0 missing, elapsed time: 13.329
.
.
.
.
.
Imputing row 7101/7414 with 1 missing, elapsed time: 13.610
Imputing row 7201/7414 with 0 missing, elapsed time: 13.610
Imputing row 7301/7414 with 0 missing, elapsed time: 13.618
Imputing row 7401/7414 with 0 missing, elapsed time: 13.618
```

```
print("count of NULL values after imputation\n")
marketing_train.isnull().sum()
```

**Output:**

```
count of NULL values before imputation
custAge 0
profession 0
marital 0
responded 0
dtype: int64
```

## Conclusion

By this, we have come to the end of this topic. In this article, we have implemented 3 different techniques of imputation.

Feel free to comment below, in case you come across any question.

For more such posts related to Python, Stay tuned @ Python with AskPython and Keep Learning!