Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages

Python Pandas Module Tutorial

Python Pandas Module

Python Pandas module is basically an open-source Python module. It has a wide scope of use in the field of computing, data analysis, statistics, etc.

Pandas module uses the basic functionalities of the NumPy module.

Thus, before proceeding with the tutorial, I would advise the readers and enthusiasts to go through and have a basic understanding of the Python NumPy module.

Once you are done with it, let’s dive together and get started with learning one of the most useful and interesting modules – Pandas.


Getting started with Python Pandas Module

Before having understood the functions of the Pandas module, we need to install the module (checking the compatibility of the version of Python with the version of the module you wish to install through the Official documentation of Pandas Module).

There are various ways to install the Python Pandas module. One of the easiest ways is to install using Python package installer i.e. PIP.

Type the following command in your Command-prompt:

pip install pandas

In order to add the Pandas and NumPy module to your code, we need to import these modules in our code.

import pandas
import numpy

Python Pandas Module – Data Structures

Pandas work around the following data structures:

  • Series
  • DataFrame
  • Panel

These data structures are faster as compared to the NumPy arrays.

1. Series

Pandas Series is a 1-dimensional structure resembling arrays containing homogeneous data in it. It is a linear data structure and stores elements in a single dimension.

Note: The size of the Series Data Structure in Pandas is immutable i.e once set, it cannot be changed dynamically. While the values/elements in the Series can be changed or manipulated.

Syntax:

pandas.Series(input_data, index, data_type, copy)
  • input_data: Takes input in vivid forms such as list, constants, NumPy arrays, Dict, etc.
  • index: Index values passed to the data.
  • data_type: Recognizes the data type.
  • copy: Copies Data. The default value is False.

Example:

import pandas
import numpy
input = numpy.array(['John','Bran','Sam','Peter'])
series_data = pandas.Series(input,index=[10,11,12,13])
print(series_data)

In the above code snippet, we have provided the input using NumPy arrays and set the index values to the input data.

Output:

10     John
11     Bran
12     Sam
13     Peter
dtype: object

2. DataFrame

Python Pandas module provides DataFrame that is a 2-dimensional structure, resembling the 2-D arrays. Here, the input data is framed in the form of rows and columns.

Note: The size of the DataFrame Data Structure in Pandas is mutable.

Syntax:

pandas.DataFrame(input_data, index_value, columns, data_type, copy)
  • input_data: Takes input in vivid forms such as list, series, NumPy arrays, Dict, another DataFrame, etc.
  • index values: Index values being passed to the data.
  • data_type: Recognizes the data type of each column.
  • copy: Copy Data. The default value is False.
  • columns: Labels provided the data of the columns.

Example:

import pandas
input = [['John','Pune'],['Bran','Mumbai'],['Peter','Delhi']]
data_frame = pandas.DataFrame(input,columns=['Name','City'],index=[1,2,3])
print(data_frame)

In the above code, we have provided the input using lists, have added labels: ‘Name’ and ‘City’ to the columns and have set the index values for the same.

Output:

    Name    City
1   John    Pune
2   Bran    Mumbai
3   Peter   Delhi

3. Panel

Python Pandas module offers a Panel that is a 3-dimensional data structure and contains 3 axes to serve the following functions:

  • items: (axis 0) Every item of it corresponds to a DataFrame in it.
  • major_axis: (axis 1) It corresponds to the rows of each DataFrame.
  • minor_axis: (axis 2) It corresponds to the columns of each DataFrame.

Syntax:

pandas.Panel(input_data, items, major_axis, minor_axis, data_type, copy)

Importing data from CSV file to DataFrame

Python Pandas module DataFrame can also be built using CSV files. A CSV file is basically a text file where data per line is stored in it. The elements are separated using “comma”.

The read_csv(file_name) method is used to read the data from the CSV file into the DataFrame.

Syntax:

pandas.read_csv()

Example:

import pandas as pd
data =  pd.read_csv('C:\\Users\\HP\\Desktop\\Book1.csv')
print(data)

Output:

    Name  Age
0   John  21
1   Bran  22

Statistical analysis in Pandas

Python Pandas module has come up with a large number of built-in methods to help the users with the statistical analysis of data.

The following is the list of some most commonly used functions for statistical analysis in pandas:

MethodDescription
count()Counts the number of all the non-empty observations
sum()Returns the sum of the data elements
mean()Returns the mean of all the data elements
median()Returns the median of all the data elements
mode()Returns the mode of all the data elements
std()Returns the Standard deviation of all the data elements
min()Returns the minimum data element among all the input elements.
max()Returns the maximum data element among all the input elements.
abs()Returns the absolute value
prod()Returns the product of data values
cumsum()Returns the cumulative sum of the data values
cumprod()Returns the cumulative product of the data values
describe()It displays the statistical summary of all the records in one shot i.e. (sum,count,min,mean,etc)

To get started, let’s create a DataFrame that we’ll be using throughout the section in understanding various functions provided for the Statistical analysis.

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Creating a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame)

Output:

    Name     Marks      Roll_num
0   John     44         1
1   Bran     48         2
2   Caret    75         3
3   Joha     33         4
4   Sam      99         5

sum() function

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.sum())

Output:

Name        JohnBranCaretJohaSam
Marks       299
Roll_num    15
dtype:      object

As seen above, the sum() function adds the data of every column separately and appends the string values wherever found.

mean() function

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.mean())

Output:

Marks     59.8
Roll_num  3.0
dtype:    float64

The mean function will not act on the strings found within the data unlike the sum() function.

min() function

import pandas
import numpy

input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.min())

Output:

Name      Bran
Marks     33
Roll_num  1
dtype:    object

count()

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.count())

Output:

Name        5
Marks       5
Roll_num    5
dtype:      int64

describe()

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}

#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.describe())

Output:

        Marks      Roll_num
count   5.000000   5.000000
mean    59.800000  3.000000
std     26.808581  1.581139
min     33.000000  1.000000
25%     44.000000  2.000000
50%     48.000000  3.000000
75%     75.000000  4.000000
max     99.000000  5.000000

Iterating Data Frames in Pandas

Iteration of data produces the following results for the three data structures:

  • Series: set of values
  • DataFrame: labels of column
  • Panel: labels of items

The following functions can be used to iterate a DataFrame:

  • iteritems() − Iterates over the data and results in (key, value) pairs
  • iterrows() − Iterates over the rows and results in (index, series) pairs
  • itertuples() − Iterates over the data rows and results in named tuples or namedtuple

Example:

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}


data_frame = pandas.DataFrame(input)
#using the iteritems() function
for key,value in data_frame.iteritems():
   print(key,value)
print("\n")
#using the iterrows() function
for row_index,row in data_frame.iterrows():
   print(row_index,row)
print("\n")
#using the itertuples() function
for row in data_frame.itertuples():
    print(row)

Output:

Name 0     John
1     Bran
2    Caret
3     Joha
4      Sam
Name: Name, dtype: object
Marks 0    44
1    48
2    75
3    33
4    99
Name: Marks, dtype: int64
Roll_num 0    1
1    2
2    3
3    4
4    5
Name: Roll_num, dtype: int64

0 Name        John
Marks         44
Roll_num       1
Name: 0, dtype: object
1 Name        Bran
Marks         48
Roll_num       2
Name: 1, dtype: object
2 Name        Caret
Marks          75
Roll_num        3
Name: 2, dtype: object
3 Name        Joha
Marks         33
Roll_num       4
Name: 3, dtype: object
4 Name        Sam
Marks        99
Roll_num      5
Name: 4, dtype: object

Pandas(Index=0, Name='John', Marks=44, Roll_num=1)
Pandas(Index=1, Name='Bran', Marks=48, Roll_num=2)
Pandas(Index=2, Name='Caret', Marks=75, Roll_num=3)
Pandas(Index=3, Name='Joha', Marks=33, Roll_num=4)
Pandas(Index=4, Name='Sam', Marks=99, Roll_num=5)

Sorting in Pandas

The following techniques are used to sort data in Pandas:

  • Sorting by label
  • Sorting by Actual value

Sorting by label

The sort_index() method is used to sort the data based on the index values.

Example:

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}


data_frame = pandas.DataFrame(input, index=[0,2,1,4,3])
print("Unsorted data frame:\n")
print(data_frame)
sorted_df=data_frame.sort_index()
print("Sorted data frame:\n")
print(sorted_df)

Output:

Unsorted data frame:

    Name  Marks  Roll_num
0   John     44         1
2   Caret    75         3
1   Bran     48         2
4   Sam      99         5
3   Joha     33         4

Sorted data frame:

    Name  Marks  Roll_num
0   John     44         1
1   Bran     48         2
2   Caret    75         3
3   Joha     33         4
4   Sam      99         5

Sorting by values

The sort_values() method is used to sort the DataFrame by values.

It accepts a ‘by’ parameter wherein we need to enter the name of the column by which the values need to be sorted.

Example:

import pandas
import numpy


input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99]),
   'Roll_num':pandas.Series([1,2,3,4,5])
}


data_frame = pandas.DataFrame(input, index=[0,2,1,4,3])
print("Unsorted data frame:\n")
print(data_frame)
sorted_df=data_frame.sort_values(by='Marks')
print("Sorted data frame:\n")
print(sorted_df)

Output:

Unsorted data frame:

    Name  Marks  Roll_num
0   John     44         1
2   Caret    75         3
1   Bran     48         2
4   Sam      99         5
3   Joha     33         4

Sorted data frame:

    Name  Marks  Roll_num
3   Joha     33         4
0   John     44         1
1   Bran     48         2
2   Caret    75         3
4    Sam     99         5

Operations on Text data in Pandas

The Python String Functions can be applied to the DataFrame.

The following contains the list of most commonly used String functions on the DataFrame:

Function
lower(): It converts the string in the DataFrame to lower case.
upper(): It converts the string in the DataFrame to Upper case.
len(): Returns the length of string.
strip(): It trims the white-spaces from both the sides of the input in the DataFrame.
split(‘ ‘): It splits the string with the input pattern.
contains(pattern): It returns true if the passed sub-string is present in the input element of DataFrame.
replace(x,y): It shuffles the values x and y.
startswith(pattern): It returns true, if the input element begins with the argument provided.
endswith(pattern): It returns true, if the input element ends with the argument provided.
swapcase: It swaps the upper to lower case and vice-versa.
islower(): It returns a boolean value and checks whether all the characters of the input are in lower case or not.
isupper(): It returns a boolean value and checks whether all the characters of the input are in upper case or not.

Example:

import pandas
import numpy


input = pandas.Series(['John','Bran','Caret','Joha','Sam'])
print("Converting the DataFrame to lower case....\n")
print(input.str.lower())
print("Converting the DataFrame to Upper Case.....\n")
print(input.str.upper())
print("Displaying the length of data element in each row.....\n")
print(input.str.len())
print("Replacing 'a' with '@'.....\n")
print(input.str.replace('a','@'))

Output:

Converting the DataFrame to lower case....

0     john
1     bran
2     caret
3     joha
4     sam
dtype: object

Converting the DataFrame to Upper Case.....

0     JOHN
1     BRAN
2     CARET
3     JOHA
4     SAM
dtype: object

Displaying the length of data element in each row.....

0    4
1    4
2    5
3    4
4    3
dtype: int64

Replacing 'a' with '@'.....

0     John
1     Br@n
2     C@ret
3     Joh@
4     S@m
dtype: object

Data Wrangling in Python Pandas Module

Data Wrangling is basically the processing and manipulation of data.

The following functions enable Data Wrangling in the Python Pandas module:

  • merge(): It is used to merge the common values of two DataFrames together.
  • groupby(): It basically collects and represents the data by grouping it by the category provided.
  • concat(): Addition of one DataFrame to another.

Example:

import pandas
import numpy


input1 = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99])}
input2 = {'Name':pandas.Series(['John','Shaun','Jim','Gifty']),
   'Marks':pandas.Series([44,45,78,99])}

#Create a DataFrame
df1 = pandas.DataFrame(input1)
df2 = pandas.DataFrame(input2)
print("DataFrame 1:\n")
print(df1)
print("DataFrame 2:\n")
print(df2)
print("Merging the DataFrames..\n")
print(pandas.merge(df1, df2, on='Marks'))
print("Grouping the DataFrame..\n")
group_by = df2.groupby('Name')
print(group_by.get_group('John'))
print("Concatenating both the DataFrames..\n")
print(pandas.concat([df1, df2]))

Output:

DataFrame 1:

    Name  Marks
0   John     44
1   Bran     48
2  Caret     75
3   Joha     33
4    Sam     99

DataFrame 2:

    Name  Marks
0   John     44
1  Shaun     45
2    Jim     78
3  Gifty     99

Merging the DataFrames..

  Name_x  Marks Name_y
0   John     44   John
1    Sam     99  Gifty

Grouping the DataFrame..

   Name  Marks
0  John     44

Concatenating both the DataFrames..

    Name  Marks
0   John     44
1   Bran     48
2  Caret     75
3   Joha     33
4    Sam     99
0   John     44
1  Shaun     45
2    Jim     78
3  Gifty     99

Data Visualization in Pandas

The data obtained as output can be further visualized in a better manner by plotting the data.

In order to plot and present the data, we first need to install the matplotlib library for the same.

pip install matplotlib

Example: Data Visualization

import pandas
import numpy as np

input1 = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
   'Marks':pandas.Series([44,48,75,33,99])}
df1 = pandas.DataFrame(input1)
df1.plot.bar()

Output:

Data Visualization
Data Visualization

Conclusion

Thus, in this tutorial, we have understood the various different methods and functions available within the Python Pandas Module.


References