In this tutorial let us understand how to explore the cars.csv dataset using Python. We will first load the dataset and then process the data. We will also be visualizing the dataset and when the final dataset is prepared, the same dataset can be used to develop various models.
Steps to Analyze Cars.csv Dataset in Python
We’ll be using Pandas and Numpy for this analysis. We’ll also be playing around with visualizations using the Seaborn library. Let’s get right into this.
1. Loading the Cars.csv Dataset
Since the dataset is already in a CSV format, all we need to do is format the data into a pandas data frame. This was done by using a pandas data frame method called
read_csv by importing
read_csv data frame method is used by passing the path of the CSV file as an argument to the function. The code results in a neatly organized pandas data frame when we make use of the
Let’s start by importing all the necessary modules and libraries into our code.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt sns.set(color_codes=True)
Now the data is loaded with the help of the
df = pd.read_csv("CARS.csv") df.head()
This data set has
428 rows and
15 features having data about different car brands such as BMW, Mercedes, Audi, and more and has multiple features about these cars such as Model, Type, Origin, Drive Train, MSRP, and more such features.
2. Removing irrelevant features
In the later sections if we are required to compute the price of the car based on some features given to us. But not all features are necessary in order to determine the price of the car, we aim to remove the same irrelevant features from our dataset.
The features that we are going to remove are Drive Train, Model, Invoice, Type, and Origin. All those features are not necessary to determine the costs. You can remove or keep features according to your preferences.
df = df.drop(['Model','DriveTrain','Invoice', 'Origin', 'Type'], axis=1) df.head()
3. Finding duplicate data
In any dataset, there might be duplicate/redundant data and in order to remove the same we make use of a
reference feature (in this case MSRP). The reason why I make MSRP as a reference is the prices of two vehicles can rarely match 100%.
In order to remove the duplicates, we make use of the code mentioned below. You can observe that the number of rows is reduced from 428 to 410 rows.
print("Count before Removing Duplicates: ") df.count()
The output looks something like what’s shown below.
Count before Removing Duplicates: Make 428 MSRP 428 EngineSize 428 Cylinders 426 Horsepower 428 MPG_City 428 MPG_Highway 428 Weight 428 Wheelbase 428 Length 428 dtype: int64
df = df.drop_duplicates(subset='MSRP', keep='first') print("Count after Removing Duplicates: ") df.count()
Count after Removing Duplicates: Make 410 MSRP 410 EngineSize 410 Cylinders 408 Horsepower 410 MPG_City 410 MPG_Highway 410 Weight 410 Wheelbase 410 Length 410 dtype: int64
4. Finding the missing or null values
No dataset is perfect and having missing values in the dataset is a pretty common thing to happen. Now, there are several approaches to deal with the missing value.
One can either drop either row or fill the empty values with the mean of all values in that column. It is better to take the mean of the column values rather than deleting the entire row as every row is important for a developer.
Let us first look at how many null values we have in our dataset.
You can observe that there are two null values in the
Cylinders column and the rest are clear. Our aim will be to handle the 2 null values of the column. It was found that the null values belong to row
247 and 248, so we will replace the same with the mean of all the values.
# Calculate mean of all the values of the column val = df['Cylinders'].mean() print("The mean of the column is: ",val) val = round(val) print("Rounded value of the mean of the column is: ",val) # Replace the null value with the mean of the column df['Cylinders'] = val df['Cylinders'] = val
5. Converting the object values to an integer type
If you haven’t observed yet, the values of MSRP start with
$ but we need the values to be of type integer. Hence, we need to make sure that the dollar sign is removed from all the values in that column.
df['MSRP'] = [x.replace('$','') for x in df['MSRP']] df['MSRP'] = [x.replace(',','') for x in df['MSRP']] df['MSRP']=pd.to_numeric(df['MSRP'],errors='coerce') df.head()
6. Visualizing HeatMaps
Heatmaps are the maps that are one of the best ways to find the correlation between the features. When the heatmaps is plotted we can see a strong dependency between the MSRP and Horsepower.
plt.figure(figsize=(10,10)) plt.style.use("seaborn") c= df.corr() sns.heatmap(c,cmap='PiYG',annot=True)
Hope you understood the concept and would apply the same in various other CSV files. Thank you for reading!