Creating dummy variables in Python

Hello, readers! In this article, we will be understanding creating dummy variables in Python.

So, let us get started!

First, what is a dummy variable?

Let me try to introduce you to the unique yet important concept of data modeling – dummy variables through the below scenario.

Consider a dataset which is a combination of continuous as well as categorical data. As soon as we read the work ‘categorical’, what first comes to our mind is categories in the data or presence of groups.

It usually happens that the variables represent vivid/ different types of categories. Handling the huge number of groups in the data and feeding it to the model becomes a tedious and complex task as the size of the dataset increases and soon the ambiguity starts to increase.

This is when the concept of dummy variables comes into picture.

A dummy variable is a numeric variable which represents the sub-categories or sub-groups of the categorical variables of the dataset.

In a nutshell, a dummy variable enables us to differentiate between different sub-groups of the data and which in terms enables us to use the data for regression analysis as well.

Have a look at the below example!

Consider a dataset that contains 10-15 data variables amongst which it contains a category of ‘Male‘ and ‘Female‘.

The task is to understand usually which gender opts and chooses ‘pink’ as the color of their mobile cases. Now, in this case, we can use dummy variables and assign 0 as Male and 1 as Female. This would inturn help the feeding model have a better understanding and clearance on the data fed.

Let us create a dummy variable in Python now!

Let us now begin with creating a dummy variable. We have used the Bike rental count prediction problem to analyse and create dummy variables.

So, let us begin!

1. Load the dataset

At first, we need to load the dataset into the working environment as shown below:

import pandas
BIKE = pandas.read_csv("Bike.csv")

The original dataset:

2. Create a copy of the original dataset to work on.

In order to make sure that the original dataset remains unaltered, we create a copy of the original dataset to work on and perform the operation of creation of dummies.

We have used pandas.dataframe.copy() function for the same.

bike = BIKE.copy()

3. Store all the categorical variables in a list

Let us now save all the categorical variables from the dataset into a list to work on!

categorical_col_updated = ['season','yr','mnth','weathersit','holiday']

4. Use get_dummies() method to create dummy of the variables

Pandas module provides us with dataframe.get_dummies() function to create dummies of the categorical data.

bike = pandas.get_dummies(bike, columns = categorical_col_updated) print(bike.columns)

We have passed the dataset, and the categorical column values to the function to create dummies.

Output:

As seen below, a dummy or separate column is created for every sub-group under each category.

Like, the column ‘month’ has all the 12 months as categories.

Thus, every single month is considered as a sub-group and the get_dummies() function has created a separate column for every column.

Index(['temp', 'hum', 'windspeed', 'cnt', 'season_1', 'season_2', 'season_3',
       'season_4', 'yr_0', 'yr_1', 'mnth_1', 'mnth_2', 'mnth_3', 'mnth_4',
       'mnth_5', 'mnth_6', 'mnth_7', 'mnth_8', 'mnth_9', 'mnth_10', 'mnth_11',
       'mnth_12', 'weathersit_1', 'weathersit_2', 'weathersit_3', 'holiday_0',
       'holiday_1'],
      dtype='object')

You can find the resultant dataset by the get_dummies() function here.