One hot encoding in Python - A Practical Approach

Hello, readers! In this article, we will be focusing on the practical implementation of One hot encoding in Python.

So, let us get started!

First, what is one hot encoding?

Before diving deep into the concept of one-hot encoding, let us understand some prerequisites.

Variables are distinguished into two main parts–

Continuous variables: These are the variables that depict the numeric variables. Example: [1,2,3,4,5,6…..100]
Categorical variables: These variables portray the category or groups in the data values. Example: [apple,mango,berry]

In a dataset, we come across data that contains the categorical data in the form of groups such as [apple, berry, mango]. In order to represent each category of the data as a separate entity, we use encoding techniques.

Most popularly used encoding techniques includes

Dummy variables
Label Encoding
One hot encoding, etc.

Today, let us discuss about One hot encoding.

One hot encoding represents the categorical data in the form of binary vectors.

Now, a question may arise in your minds, that when it represents the categories in a binary vector format, then when does it get the data converted into 0’s and 1’s i.e. integers?

Well, in one hot encoding scheme, prior to applying it to the data, we need to map the categorical data values to the integer data values. This is done with the help of Label Encoding.

Don’t worry, we will be covering the practical implementation of the use of Label Encoding in further sections.

So, by one hot encoding, every category of the data values would be assigned an integer value and would be mapped into the binary vector.

So, every data value that is mapped to the integer value would be represented as a binary vector wherein, all values in the vector would be zero except the index value of the integer(category) that would be marked as 1.

One Hot Encoding Implementation Examples

Consider the dataset with categorical data as [apple and berry]. After applying Label encoding, let’s say it would assign apple as ‘0’ and berry as ‘1’.

Further, on applying one-hot encoding, it will create a binary vector of length 2. Here, the label ‘apple’ which is encoded as ‘0’ would be having a binary vector as [1,0].

This is because the value 1 would be placed at the encoded index which is zero for apple(as seen in the label encoding of it).

So, [apple, berry, berry] would be encoded as :

[1, 0]
[0, 1]
[0, 1]

Let us now implement the concept through examples.

Example 1: One hot encoding with the grouped categorical data

Have a look at the below example! We have encoded the category of fruits with one hot encoding.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder


cat_data = ["apple", "mango", "apple", "berry", "mango", "apple", "berry", "apple"]


label = LabelEncoder()
int_data = label.fit_transform(cat_data)
int_data = int_data.reshape(len(int_data), 1)

onehot_data = OneHotEncoder(sparse=False)
onehot_data = onehot_data.fit_transform(int_data)
print("Categorical data encoded into integer values....\n")
print(onehot_data)

Output:

Categorical data encoded into integer values....

[[1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]

Explanation:

After having loaded the data, we have created an object of LabelEncoder() to encode the categorical data into the integer values altogether.
Further, we would pass the same integer data to the OneHotEncoder() to encode the integer values into the binary vectors of the categories.
The fit_transform() function applies the particular function to be performed on the data or set of values.

Example 2: One hot encoding on a dataset

In this example, we have pulled a dataset into the Python environment. You can find the dataset below for your reference.

Further, we have used the ColumnTransformer() function to create an object that indicates the category 0 as the first column out of the N categories.

At last, we have applied it to the entire categorical data to be encoded into the binary array form.

Let’s import the pandas and numpy libraries.

import pandas
import numpy
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 

cat_data = pandas.read_csv("bank-loan.csv")
#print(cat_data)

column_set = ColumnTransformer([('encoder', OneHotEncoder(),[0])], remainder='passthrough') 
  
onehot_data = numpy.array(column_set.fit_transform(cat_data), dtype = numpy.str) 

print(onehot_data)

Output:

So, you see, the data now contains two columns: the first column depicts the 0th category and the second column depicts the 1st category.

[['0.0' '1.0']
 ['1.0' '0.0']
 ['1.0' '0.0']
 ...
 ['1.0' '0.0']
 ['1.0' '0.0']
 ['1.0' '0.0']]

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. Till then, Stay tuned and Happy Learning!! 🙂