Data Analysis in Python with Pandas

Data Analysis

Data Analysis is one of the most important tools in today’s world. Data is present in every domain of life today whether it is biological data or data from a tech company. No matter what kind of data you are working with, you must know how to filter and analyze your data. Today we are going to deal with one such data analysis tool in Python i.e Pandas. 

Let’s get started by first learning about some of the major libraries used for data analysis in Python.

Major Libraries for Data Analysis in Python

Python has many robust tools for data analysis such as Python libraries which provide data analysts the necessary functionality to analyze data. 

  • Numpy and Scipy: Both of these libraries are powerful and extensively used in scientific computing.
  • Pandas: Pandas is a robust tool used for data manipulation. Pandas is a relatively new tool that have been added to the library of data science.
  • Matplotlib: Matplotlib is an excellent package and is mainly used for plotting and visualization. You can plot a variety of graphs using Matplotlib, such as histograms, line plots, heat plots, etc.
  • Scikit-Learn: Scikit-Learn is an excellent tool for machine learning. This library has all the necessary tools required for machine learning and statistical modeling.
  • Stats Models: It is another excellent tool for statistical modelling. This library allows users to build statistical models and analyze them.
  • Seaborn: Seaborn is also extensively used for data visualization. It is based on Matplotlib and is used for building statistical graphics in Python.

Out of all these tools, we are going to learn about Pandas and work with hands-on data analysis in Pandas in this article.

What is Pandas and Why is it so useful in Data Analysis?

Pandas is an open-source python library built on top of the Numpy Package. It provides all the necessary functions and methods which make the data analysis process faster and easier. Because of its flexibility and simpler syntax, it is most commonly used for data analysis. Pandas is really helpful when it comes to working with Excel spreadsheets, tabular data, or SQL.

The two main data structures in Pandas are DataFrame and Series. A DataFrame is a two-dimensional data structure. In this article, we will be working with the Pandas dataframe. Data can be imported in a variety of formats for data analysis in Python, such as CSV, JSON, and SQL.

Now let’s get on to the data analysis part.

Installing Different Environments and Importing Pandas

First, you need to install Pandas. You can use different environments for the same. You can either use Anaconda to run Pandas directly on your computer or you can also use a Jupyter Notebook through your browser on Google Cloud. Anaconda comes with many pre-installed packages and can easily be downloaded on Mac, Windows, or Linux.

Let’s see the following steps on how to install and import Pandas. To install Pandas in your environment, use the pip command.

pip install pandas

Note: If you are using Google Colab, you do not need to add this command since Google Colab comes with Pandas pre-installed.

Now to import Pandas into your environment type the following command.

import pandas as pd

Now that we know, how to install and import Pandas, let’s understand more closely what Pandas Dataframe is.

The Pandas DataFrame

Pandas DataFrame is a two-dimensional Data structure, almost like a 2-D array.DataFrame has labeled axes (rows and columns) and is mutable.

Let’s get on to the hands-on data analysis part.

In this article, we are using the data provided from a Kaggle competition about the “height of male and female by country in 2022.”

Link to the dataset: https://www.kaggle.com/majyhain/height-of-male-and-female-by-country-2022

Let’s load the dataset now and read it.

Reading CSV Files and Loading the Data

To read the file into DataFrame, you need to put the path of your file as an argument to the following function.

df = pd.read_csv("C://Users//Intel//Documents//Height of Male and Female by Country 2022.csv")
df.head()

Here we have used the read_csv function as we are reading a CSV file.

Screenshot 346

You can check the first n entries of your dataframe with the help of the head function. If you don’t pass the number of entries, the first 5 rows will be displayed by default.

Evaluating the Pandas DataFrame

Now we will have a look at the dataframe that we are working with.

Let’s have a look at the dimensions of the data that we are using. For that, we need to pass the following command.

df.shape
(199, 6)

The shape function will return a tuple with the number of rows and columns. We can see that our dataframe has 199 rows and 6 columns, or features.

Next, we will see a summary of our dataset with the help of the info function.

df.info
<bound method DataFrame.info of      Rank            Country Name  Male Height in Cm  Female Height in Cm  \
0       1             Netherlands             183.78               170.36   
1       2              Montenegro             183.30               169.96   
2       3                 Estonia             182.79               168.66   
3       4  Bosnia and Herzegovina             182.47               167.47   
4       5                 Iceland             182.10               168.91   
..    ...                     ...                ...                  ...   
194   195              Mozambique             164.30               155.42   
195   196        Papua New Guinea             163.10               156.89   
196   197         Solomon Islands             163.07               156.79   
197   198                    Laos             162.78               153.10   
198   199             Timor-Leste             160.13               152.71   

     Male Height in Ft  Female Height in Ft  
0                 6.03                 5.59  
1                 6.01                 5.58  
2                 6.00                 5.53  
3                 5.99                 5.49  
4                 5.97                 5.54  
..                 ...                  ...  
194               5.39                 5.10  
195               5.35                 5.15  
196               5.35                 5.14  
197               5.34                 5.02  
198               5.25                 5.01  

[199 rows x 6 columns]>

You can see that the output gives us some valuable information about the data frame. It shows dtypes, memory usage, non-null values, and column names.

Next, we will get a little bit of an idea of the statistics of the dataset.

df.describe()
Screenshot 349

In the output, we can see counts, mean, median, standard deviation, upper and lower quartiles, and minimum and maximum values for each feature present in the dataset.

Data Manipulation and Analysis

Let’s first quickly look at the different features in the dataset to help you get a better understanding of the dataset.

Country Name: Name of the country for which data has been collected.

Male Height in Centimeters: Height of the Male population in centimeters

Female Height in Cm-Height of Female Population in Cm

Male Height in Ft.-Height of the male population in Ft.

Female Height in Ft.-Height of the female population in Ft.

Setting the DataFrame Index

Now, let’s set the data frame index.

We can see from our data that the first column ‘Rank’ is different for different countries and starts from number1. We can make use of that and set the ‘Rank’ column as the index.

df.set_index('Rank',inplace=True)
df.index

Let’s see the dataframe once again.

df= pd.read_csv("C://Users//Intel//Documents//Height of Male and Female by Country 2022.csv", index_col='Rank')
df.head()

Screenshot 351

The dataset looks a bit more organized now.

Rows and Columns

You already know that dataframes have rows and columns. The columns in the dataframe can be easily accessed with the following commands:

df.columns
Index(['Country Name', 'Male Height in Cm', 'Female Height in Cm',
       'Male Height in Ft', 'Female Height in Ft'],
      dtype='object')
df['Country Name'].head()
Rank
1               Netherlands
2                Montenegro
3                   Estonia
4    Bosnia and Herzegovina
5                   Iceland
Name: Country Name, dtype: object

We can also rename our columns with the following command:

df.rename(columns={'Male Height in Cm': 'Male Height in Centimeter'}, inplace=True)
df.head()
Screenshot 353

You can also add columns to your data frame. Let’s take a look at how we can do that.

df_copy = df.copy()
df_copy['Height Ratio'] = 'N'
df_copy.head()
Screenshot 355

We have assigned the value of “N” to the new columns.

Let’s imagine you have another dataframe that you want to append or add to the existing DataFrame(df_copy). We can do that with the help of the append function.

data_to_append = {'Country Name': ['X', 'Y'],
                  'Male Height in Centimeter': ['172.43', '188.94'],
                  'Female Height in Cm': ['150.99', '160.99'],
                  'Male Height in Ft': ['6.09', '5.44'],
                  'Female Height in Ft': ['5.66', '6.66'],
                  'Height Ratio': ['Y', 'N']}
                  
df_append = pd.DataFrame(data_to_append)
df_append
Screenshot 357
df_copy = df_copy.append(df_append, ignore_index=True)
df_copy.tail()
Screenshot 360

We can use the drop function to remove rows and columns from our dataframe.

For removing rows, you should use the following code:

df_copy.drop(labels=179, axis=0, inplace=True)

For removing columns, the following code will work:

df_copy.drop(labels='Height Ratio', axis=1, inplace=True)

Filtering the Data

We can also select the specific data we need. We will use one of the simplest methods, loc, and iloc, to select the data.

For example:

We are using loc to access rows and columns based on labels/indexes.

df.loc[193]
Country Name                  Nepal
Male Height in Centimeter    164.36
Female Height in Cm          152.39
Male Height in Ft              5.39
Female Height in Ft               5
Name: 193, dtype: object

You can also visualize columns using the following code.

df.loc[193, ['Country Name', 'Male Height in Centimeter','Female Height in Cm']]
Country Name                  Nepal
Male Height in Centimeter    164.36
Female Height in Cm          152.39
Name: 193, dtype: object

Now, if you want to see the male population with a height above 17 cm, we will add a condition to loc.

df.loc[df['Male Height in Centimeter'] >= 170]
Screenshot 362

If you want to select data present in the first row and column only, you can use iloc. iloc selects data based on integer position or boolean array.

df.iloc[0,0]
'Netherlands'

You can also select an entire row. In this case, we have accessed row no. 10.

df.iloc[10,:]
Country Name                 Ukraine
Male Height in Centimeter     180.98
Female Height in Cm           166.62
Male Height in Ft               5.94
Female Height in Ft             5.47
Name: 11, dtype: object

We can also select an entire column. In this case, we have selected the last column.

df.iloc[:,-1]
Rank
1      5.59
2      5.58
3      5.53
4      5.49
5      5.54
       ... 
195    5.10
196    5.15
197    5.14
198    5.02
199    5.01
Name: Female Height in Ft, Length: 199, dtype: float64

You can also select multiple rows and columns.

df.iloc[100:199, 2:5]
Screenshot 364

In the next section, we will learn how to look for missing data.

Working with Missing Values

The first step to identifying the missing value in the dataframe is to use the function isnull.

df.isnull()
Screenshot 366

We can see that the output is the same object with the same dimensions as the original DataFrame with boolean values for each and every element of the dataset.

The missing values are considered True in this case, else they will be considered False. In this case, we can safely say that we do not have any missing values. However, we will run another quality check for our data with the following command.

df.isnull().sum()
Country Name                 0
Male Height in Centimeter    0
Female Height in Cm          0
Male Height in Ft            0
Female Height in Ft          0
dtype: int64

Let’s check the proportion of missing values for each column.

df.isnull().sum() / df.shape[0]
Country Name                 0.0
Male Height in Centimeter    0.0
Female Height in Cm          0.0
Male Height in Ft            0.0
Female Height in Ft          0.0
dtype: float64

We can see that the proportion of missing values is zero for all the columns.

Plotting the Data

This is the most important part of any data analysis project. In this part, we will learn how we can use Pandas to visualize our data. We will use the plot function in Pandas to build the plots.

Note: There are many other Python libraries that provide better data visualization. If anyone would like to have more detailed and elaborate plots, they can use the Matplotlib and Seaborn libraries.

Histograms

A histogram helps you to quickly understand and visualize the distribution of numerical variables within your dataset. A histogram will divide the values within each numerical variable into bins and will count the total number of observations that fall into each bin. Histograms help to distribute the data and get an immediate intuition about your data.

In the following example, we have plotted a histogram for the feature “male height in centimeters.”

df['Male Height in Centimeter'].plot(kind='hist')
Screenshot 368

You can see from the histogram that most f male population have heights 175 cm and 180cm.

Scatter Plots

Scatter Plots help you to visualize the relationship between two variables. The plot is built on cartesian coordinates. Scatter plots display the values as a collection of points and each point denotes the value of one variable indicating the position on the X-axis and another variable indicating the position Y-axis.

In the following example, we have built a scatter plot to understand the relationship between the two variables, i.e., male height and female height.

df.plot(x='Male Height in Centimeter', y='Female Height in Cm', kind='scatter')
Screenshot 370

Conclusion

In this article, we learned a lot about hands-on data analysis in Python using Pandas, and I think that will help you a lot to understand what you can do with Pandas. Nowadays, Pandas is a widely used tool in data science and have replaced Excel in the work field. Pandas make data analysis a lot easier with its simpler syntax and flexibility. Hope you had fun with Pandas!