Hello, folks! In this article, we will be focusing on 3 important techniques to Impute missing data values in Python.
So, let us begin.
Why do we need to impute missing data values?
Before going ahead with imputation, let us understand what is a missing value.
So, a missing value is the part of the dataset that seems missing or is a null value, maybe due to some missing data during research or data collection.
Having a missing value in a machine learning model is considered very inefficient and hazardous because of the following reasons:
- Reduces the efficiency of the ML model.
- Affects the overall distribution of data values.
- It leads to a biased effect in the estimation of the ML model.
This is when imputation comes into picture.
By imputation, we mean to replace the missing or null values with a particular value in the entire dataset.
Imputation can be done using any of the below techniques–
- Impute by mean
- Impute by median
- Knn Imputation
Let us now understand and implement each of the techniques in the upcoming section.
1. Impute missing data values by MEAN
The missing values can be imputed with the mean of that particular feature/data variable. That is, the null or missing values can be replaced by the mean of the data values of that particular data column or dataset.
Let us have a look at the below dataset which we will be using throughout the article.

As clearly seen, the above dataset contains NULL values. Let us now try to impute them with the mean of the feature.
Import the required libraries
Here, at first, let us load the necessary datasets into the working environment.
#Load libraries
import os
import pandas as pd
import numpy as np
We have used pandas.read_csv() function to load the dataset into the environment.
marketing_train = pd.read_csv("C:/marketing_tr.csv")
Verify missing values in the database
Before we imputing missing data values, it is necessary to check and detect the presence of missing values using isnull() function
as shown below–
marketing_train.isnull().sum()
After executing the above line of code, we get the following count of missing values as output:
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
As clearly seen, the data variable ‘custAge’ contains 1804 missing values out of 7414 records.
Use the mean() method on all the null values
Further, we have used mean() function
to impute all the null values with the mean of the column ‘custAge’.
missing_col = ['custAge']
#Technique 1: Using mean to impute the missing values
for i in missing_col:
marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].mean()
Verify the changes
After performing the imputation with mean, let us check whether all the values have been imputed or not.
marketing_train.isnull().sum()
As seen below, all the missing values have been imputed and thus, we see no more missing values present.
custAge 0
profession 0
marital 0
responded 0
dtype: int64
2. Imputation with median
In this technique, we impute the missing values with the median of the data values or the data set.
Let us understand this with the below example.
Example:
#Load libraries
import os
import pandas as pd
import numpy as np
marketing_train = pd.read_csv("C:/marketing_tr.csv")
print("count of NULL values before imputation\n")
marketing_train.isnull().sum()
missing_col = ['custAge']
#Technique 2: Using median to impute the missing values
for i in missing_col:
marketing_train.loc[marketing_train.loc[:,i].isnull(),i]=marketing_train.loc[:,i].median()
print("count of NULL values after imputation\n")
marketing_train.isnull().sum()
Here, we have imputed the missing values with median using median() function
.
Output:
count of NULL values before imputation
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
count of NULL values after imputation
custAge 0
profession 0
marital 0
responded 0
dtype: int64
3. KNN Imputation
In this technique, the missing values get imputed based on the KNN algorithm i.e. K-nearest-neighbour algorithm.
In this algorithm, the missing values get replaced by the nearest neighbor estimated values.
Let us understand the implementation using the below example:
KNN Imputation:
#Load libraries
import os
import pandas as pd
import numpy as np
marketing_train = pd.read_csv("C:/marketing_tr.csv")
print("count of NULL values before imputation\n")
marketing_train.isnull().sum()
Here, is the count of missing values:
count of NULL values before imputation
custAge 1804
profession 0
marital 0
responded 0
dtype: int64
In the below piece of code, we have converted the data types of the data variables to object type with categorical codes assigned to them.
lis = []
for i in range(0, marketing_train.shape[1]):
if(marketing_train.iloc[:,i].dtypes == 'object'):
marketing_train.iloc[:,i] = pd.Categorical(marketing_train.iloc[:,i])
#print(marketing_train[[i]])
marketing_train.iloc[:,i] = marketing_train.iloc[:,i].cat.codes
marketing_train.iloc[:,i] = marketing_train.iloc[:,i].astype('object')
lis.append(marketing_train.columns[i])
The KNN() function
is used to impute the missing values with the nearest neighbour possible.
#Apply KNN imputation algorithm
marketing_train = pd.DataFrame(KNN(k = 3).fit_transform(marketing_train), columns = marketing_train.columns)
Output of imputation:
Imputing row 1/7414 with 0 missing, elapsed time: 13.293
Imputing row 101/7414 with 1 missing, elapsed time: 13.311
Imputing row 201/7414 with 0 missing, elapsed time: 13.319
Imputing row 301/7414 with 0 missing, elapsed time: 13.319
Imputing row 401/7414 with 0 missing, elapsed time: 13.329
.
.
.
.
.
Imputing row 7101/7414 with 1 missing, elapsed time: 13.610
Imputing row 7201/7414 with 0 missing, elapsed time: 13.610
Imputing row 7301/7414 with 0 missing, elapsed time: 13.618
Imputing row 7401/7414 with 0 missing, elapsed time: 13.618
print("count of NULL values after imputation\n")
marketing_train.isnull().sum()
Output:
count of NULL values before imputation
custAge 0
profession 0
marital 0
responded 0
dtype: int64
Conclusion
By this, we have come to the end of this topic. In this article, we have implemented 3 different techniques of imputation.
Feel free to comment below, in case you come across any question.
For more such posts related to Python, Stay tuned @ Python with AskPython and Keep Learning!