Analyzing Cars.csv File in Python - A Complete Guide

In this tutorial let us understand how to explore the cars.csv dataset using Python. We will first load the dataset and then process the data. We will also be visualizing the dataset and when the final dataset is prepared, the same dataset can be used to develop various models.

Steps to Analyze Cars.csv Dataset in Python

We’ll be using Pandas and Numpy for this analysis. We’ll also be playing around with visualizations using the Seaborn library. Let’s get right into this.

1. Loading the Cars.csv Dataset

Since the dataset is already in a CSV format, all we need to do is format the data into a pandas data frame. This was done by using a pandas data frame method called read_csv by importing pandas library.

The read_csv data frame method is used by passing the path of the CSV file as an argument to the function. The code results in a neatly organized pandas data frame when we make use of the head function.

Let’s start by importing all the necessary modules and libraries into our code.

import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
sns.set(color_codes=True)

Now the data is loaded with the help of the pandas module.

df = pd.read_csv("CARS.csv")
df.head()

This data set has 428 rows and 15 features having data about different car brands such as BMW, Mercedes, Audi, and more and has multiple features about these cars such as Model, Type, Origin, Drive Train, MSRP, and more such features.

2. Removing irrelevant features

In the later sections if we are required to compute the price of the car based on some features given to us. But not all features are necessary in order to determine the price of the car, we aim to remove the same irrelevant features from our dataset.

The features that we are going to remove are Drive Train, Model, Invoice, Type, and Origin. All those features are not necessary to determine the costs. You can remove or keep features according to your preferences.

df = df.drop(['Model','DriveTrain','Invoice', 'Origin', 'Type'], axis=1)
df.head()

3. Finding duplicate data

In any dataset, there might be duplicate/redundant data and in order to remove the same we make use of a reference feature (in this case MSRP). The reason why I make MSRP as a reference is the prices of two vehicles can rarely match 100%.

In order to remove the duplicates, we make use of the code mentioned below. You can observe that the number of rows is reduced from 428 to 410 rows.

print("Count before Removing Duplicates: ")
df.count()

The output looks something like what’s shown below.

Count before Removing Duplicates: 
Make           428
MSRP           428
EngineSize     428
Cylinders      426
Horsepower     428
MPG_City       428
MPG_Highway    428
Weight         428
Wheelbase      428
Length         428
dtype: int64

df = df.drop_duplicates(subset='MSRP', keep='first')

print("Count after Removing Duplicates: ")
df.count()

Count after Removing Duplicates: 
Make           410
MSRP           410
EngineSize     410
Cylinders      408
Horsepower     410
MPG_City       410
MPG_Highway    410
Weight         410
Wheelbase      410
Length         410
dtype: int64

4. Finding the missing or null values

No dataset is perfect and having missing values in the dataset is a pretty common thing to happen. Now, there are several approaches to deal with the missing value.

One can either drop either row or fill the empty values with the mean of all values in that column. It is better to take the mean of the column values rather than deleting the entire row as every row is important for a developer.

Let us first look at how many null values we have in our dataset.

print(df.isnull().sum())

You can observe that there are two null values in the Cylinders column and the rest are clear. Our aim will be to handle the 2 null values of the column. It was found that the null values belong to row 247 and 248, so we will replace the same with the mean of all the values.

# Calculate mean of all the values of the column
val = df['Cylinders'].mean()
print("The mean of the column is: ",val)
val = round(val)
print("Rounded value of  the mean of the column is: ",val)

# Replace the null value with the mean of the column
df['Cylinders'][247] = val
df['Cylinders'][248] = val

5. Converting the object values to an integer type

If you haven’t observed yet, the values of MSRP start with $ but we need the values to be of type integer. Hence, we need to make sure that the dollar sign is removed from all the values in that column.

df['MSRP'] = [x.replace('$','') for x in df['MSRP']] 

df['MSRP'] = [x.replace(',','') for x in df['MSRP']]

df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce')

df.head()

6. Visualizing HeatMaps

Heatmaps are the maps that are one of the best ways to find the correlation between the features. When the heatmaps is plotted we can see a strong dependency between the MSRP and Horsepower.

plt.figure(figsize=(10,10))
plt.style.use("seaborn")
c= df.corr()
sns.heatmap(c,cmap='PiYG',annot=True)

Hope you understood the concept and would apply the same in various other CSV files. Thank you for reading!