Python Pandas module is basically an open-source Python module. It has a wide scope of use in the field of computing, data analysis, statistics, etc.
Pandas module uses the basic functionalities of the NumPy module.
Thus, before proceeding with the tutorial, I would advise the readers and enthusiasts to go through and have a basic understanding of the Python NumPy module.
Once you are done with it, let’s dive together and get started with learning one of the most useful and interesting modules – Pandas.
Getting started with Python Pandas Module
Before having understood the functions of the Pandas module, we need to install the module (checking the compatibility of the version of Python with the version of the module you wish to install through the Official documentation of Pandas Module).
There are various ways to install the Python Pandas module. One of the easiest ways is to install using Python package installer i.e. PIP.
Type the following command in your Command-prompt:
pip install pandas
In order to add the Pandas and NumPy module to your code, we need to import these modules in our code.
import pandas
import numpy
Python Pandas Module – Data Structures
Pandas work around the following data structures:
- Series
- DataFrame
- Panel
These data structures are faster as compared to the NumPy arrays.
1. Series
Pandas Series is a 1-dimensional structure resembling arrays containing homogeneous data in it. It is a linear data structure and stores elements in a single dimension.
Note: The size of the Series Data Structure in Pandas is immutable i.e once set, it cannot be changed dynamically. While the values/elements in the Series can be changed or manipulated.
Syntax:
pandas.Series(input_data, index, data_type, copy)
- input_data: Takes input in vivid forms such as list, constants, NumPy arrays, Dict, etc.
- index: Index values passed to the data.
- data_type: Recognizes the data type.
- copy: Copies Data. The default value is False.
Example:
import pandas
import numpy
input = numpy.array(['John','Bran','Sam','Peter'])
series_data = pandas.Series(input,index=[10,11,12,13])
print(series_data)
In the above code snippet, we have provided the input using NumPy arrays and set the index values to the input data.
Output:
10 John
11 Bran
12 Sam
13 Peter
dtype: object
2. DataFrame
Python Pandas module provides DataFrame that is a 2-dimensional structure, resembling the 2-D arrays. Here, the input data is framed in the form of rows and columns.
Note: The size of the DataFrame Data Structure in Pandas is mutable.
Syntax:
pandas.DataFrame(input_data, index_value, columns, data_type, copy)
- input_data: Takes input in vivid forms such as list, series, NumPy arrays, Dict, another DataFrame, etc.
- index values: Index values being passed to the data.
- data_type: Recognizes the data type of each column.
- copy: Copy Data. The default value is False.
- columns: Labels provided the data of the columns.
Example:
import pandas
input = [['John','Pune'],['Bran','Mumbai'],['Peter','Delhi']]
data_frame = pandas.DataFrame(input,columns=['Name','City'],index=[1,2,3])
print(data_frame)
In the above code, we have provided the input using lists, have added labels: ‘Name’ and ‘City’ to the columns and have set the index values for the same.
Output:
Name City
1 John Pune
2 Bran Mumbai
3 Peter Delhi
3. Panel
Python Pandas module offers a Panel that is a 3-dimensional data structure and contains 3 axes to serve the following functions:
- items: (axis 0) Every item of it corresponds to a DataFrame in it.
- major_axis: (axis 1) It corresponds to the rows of each DataFrame.
- minor_axis: (axis 2) It corresponds to the columns of each DataFrame.
Syntax:
pandas.Panel(input_data, items, major_axis, minor_axis, data_type, copy)
Importing data from CSV file to DataFrame
Python Pandas module DataFrame can also be built using CSV files. A CSV file is basically a text file where data per line is stored in it. The elements are separated using “comma”.
The read_csv(file_name) method is used to read the data from the CSV file into the DataFrame.
Syntax:
pandas.read_csv()
Example:
import pandas as pd
data = pd.read_csv('C:\\Users\\HP\\Desktop\\Book1.csv')
print(data)
Output:
Name Age
0 John 21
1 Bran 22
Statistical analysis in Pandas
Python Pandas module has come up with a large number of built-in methods to help the users with the statistical analysis of data.
The following is the list of some most commonly used functions for statistical analysis in pandas:
Method | Description |
---|---|
count() | Counts the number of all the non-empty observations |
sum() | Returns the sum of the data elements |
mean() | Returns the mean of all the data elements |
median() | Returns the median of all the data elements |
mode() | Returns the mode of all the data elements |
std() | Returns the Standard deviation of all the data elements |
min() | Returns the minimum data element among all the input elements. |
max() | Returns the maximum data element among all the input elements. |
abs() | Returns the absolute value |
prod() | Returns the product of data values |
cumsum() | Returns the cumulative sum of the data values |
cumprod() | Returns the cumulative product of the data values |
describe() | It displays the statistical summary of all the records in one shot i.e. (sum,count,min,mean,etc) |
To get started, let’s create a DataFrame that we’ll be using throughout the section in understanding various functions provided for the Statistical analysis.
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Creating a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame)
Output:
Name Marks Roll_num
0 John 44 1
1 Bran 48 2
2 Caret 75 3
3 Joha 33 4
4 Sam 99 5
sum() function
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.sum())
Output:
Name JohnBranCaretJohaSam
Marks 299
Roll_num 15
dtype: object
As seen above, the sum() function adds the data of every column separately and appends the string values wherever found.
mean() function
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.mean())
Output:
Marks 59.8
Roll_num 3.0
dtype: float64
The mean function will not act on the strings found within the data unlike the sum() function.
min() function
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.min())
Output:
Name Bran
Marks 33
Roll_num 1
dtype: object
count()
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.count())
Output:
Name 5
Marks 5
Roll_num 5
dtype: int64
describe()
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
#Create a DataFrame
data_frame = pandas.DataFrame(input)
print(data_frame.describe())
Output:
Marks Roll_num
count 5.000000 5.000000
mean 59.800000 3.000000
std 26.808581 1.581139
min 33.000000 1.000000
25% 44.000000 2.000000
50% 48.000000 3.000000
75% 75.000000 4.000000
max 99.000000 5.000000
Iterating Data Frames in Pandas
Iteration of data produces the following results for the three data structures:
- Series: set of values
- DataFrame: labels of column
- Panel: labels of items
The following functions can be used to iterate a DataFrame:
- iteritems() − Iterates over the data and results in (key, value) pairs
- iterrows() − Iterates over the rows and results in (index, series) pairs
- itertuples() − Iterates over the data rows and results in named tuples or namedtuple
Example:
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
data_frame = pandas.DataFrame(input)
#using the iteritems() function
for key,value in data_frame.iteritems():
print(key,value)
print("\n")
#using the iterrows() function
for row_index,row in data_frame.iterrows():
print(row_index,row)
print("\n")
#using the itertuples() function
for row in data_frame.itertuples():
print(row)
Output:
Name 0 John
1 Bran
2 Caret
3 Joha
4 Sam
Name: Name, dtype: object
Marks 0 44
1 48
2 75
3 33
4 99
Name: Marks, dtype: int64
Roll_num 0 1
1 2
2 3
3 4
4 5
Name: Roll_num, dtype: int64
0 Name John
Marks 44
Roll_num 1
Name: 0, dtype: object
1 Name Bran
Marks 48
Roll_num 2
Name: 1, dtype: object
2 Name Caret
Marks 75
Roll_num 3
Name: 2, dtype: object
3 Name Joha
Marks 33
Roll_num 4
Name: 3, dtype: object
4 Name Sam
Marks 99
Roll_num 5
Name: 4, dtype: object
Pandas(Index=0, Name='John', Marks=44, Roll_num=1)
Pandas(Index=1, Name='Bran', Marks=48, Roll_num=2)
Pandas(Index=2, Name='Caret', Marks=75, Roll_num=3)
Pandas(Index=3, Name='Joha', Marks=33, Roll_num=4)
Pandas(Index=4, Name='Sam', Marks=99, Roll_num=5)
Sorting in Pandas
The following techniques are used to sort data in Pandas:
- Sorting by label
- Sorting by Actual value
Sorting by label
The sort_index() method is used to sort the data based on the index values.
Example:
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
data_frame = pandas.DataFrame(input, index=[0,2,1,4,3])
print("Unsorted data frame:\n")
print(data_frame)
sorted_df=data_frame.sort_index()
print("Sorted data frame:\n")
print(sorted_df)
Output:
Unsorted data frame:
Name Marks Roll_num
0 John 44 1
2 Caret 75 3
1 Bran 48 2
4 Sam 99 5
3 Joha 33 4
Sorted data frame:
Name Marks Roll_num
0 John 44 1
1 Bran 48 2
2 Caret 75 3
3 Joha 33 4
4 Sam 99 5
Sorting by values
The sort_values() method is used to sort the DataFrame by values.
It accepts a ‘by’ parameter wherein we need to enter the name of the column by which the values need to be sorted.
Example:
import pandas
import numpy
input = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99]),
'Roll_num':pandas.Series([1,2,3,4,5])
}
data_frame = pandas.DataFrame(input, index=[0,2,1,4,3])
print("Unsorted data frame:\n")
print(data_frame)
sorted_df=data_frame.sort_values(by='Marks')
print("Sorted data frame:\n")
print(sorted_df)
Output:
Unsorted data frame:
Name Marks Roll_num
0 John 44 1
2 Caret 75 3
1 Bran 48 2
4 Sam 99 5
3 Joha 33 4
Sorted data frame:
Name Marks Roll_num
3 Joha 33 4
0 John 44 1
1 Bran 48 2
2 Caret 75 3
4 Sam 99 5
Operations on Text data in Pandas
The Python String Functions can be applied to the DataFrame.
The following contains the list of most commonly used String functions on the DataFrame:
Function |
---|
lower(): It converts the string in the DataFrame to lower case. |
upper(): It converts the string in the DataFrame to Upper case. |
len(): Returns the length of string. |
strip(): It trims the white-spaces from both the sides of the input in the DataFrame. |
split(‘ ‘): It splits the string with the input pattern. |
contains(pattern): It returns true if the passed sub-string is present in the input element of DataFrame. |
replace(x,y): It shuffles the values x and y. |
startswith(pattern): It returns true, if the input element begins with the argument provided. |
endswith(pattern): It returns true, if the input element ends with the argument provided. |
swapcase: It swaps the upper to lower case and vice-versa. |
islower(): It returns a boolean value and checks whether all the characters of the input are in lower case or not. |
isupper(): It returns a boolean value and checks whether all the characters of the input are in upper case or not. |
Example:
import pandas
import numpy
input = pandas.Series(['John','Bran','Caret','Joha','Sam'])
print("Converting the DataFrame to lower case....\n")
print(input.str.lower())
print("Converting the DataFrame to Upper Case.....\n")
print(input.str.upper())
print("Displaying the length of data element in each row.....\n")
print(input.str.len())
print("Replacing 'a' with '@'.....\n")
print(input.str.replace('a','@'))
Output:
Converting the DataFrame to lower case....
0 john
1 bran
2 caret
3 joha
4 sam
dtype: object
Converting the DataFrame to Upper Case.....
0 JOHN
1 BRAN
2 CARET
3 JOHA
4 SAM
dtype: object
Displaying the length of data element in each row.....
0 4
1 4
2 5
3 4
4 3
dtype: int64
Replacing 'a' with '@'.....
0 John
1 Br@n
2 C@ret
3 Joh@
4 S@m
dtype: object
Data Wrangling in Python Pandas Module
Data Wrangling is basically the processing and manipulation of data.
The following functions enable Data Wrangling in the Python Pandas module:
- merge(): It is used to merge the common values of two DataFrames together.
- groupby(): It basically collects and represents the data by grouping it by the category provided.
- concat(): Addition of one DataFrame to another.
Example:
import pandas
import numpy
input1 = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99])}
input2 = {'Name':pandas.Series(['John','Shaun','Jim','Gifty']),
'Marks':pandas.Series([44,45,78,99])}
#Create a DataFrame
df1 = pandas.DataFrame(input1)
df2 = pandas.DataFrame(input2)
print("DataFrame 1:\n")
print(df1)
print("DataFrame 2:\n")
print(df2)
print("Merging the DataFrames..\n")
print(pandas.merge(df1, df2, on='Marks'))
print("Grouping the DataFrame..\n")
group_by = df2.groupby('Name')
print(group_by.get_group('John'))
print("Concatenating both the DataFrames..\n")
print(pandas.concat([df1, df2]))
Output:
DataFrame 1:
Name Marks
0 John 44
1 Bran 48
2 Caret 75
3 Joha 33
4 Sam 99
DataFrame 2:
Name Marks
0 John 44
1 Shaun 45
2 Jim 78
3 Gifty 99
Merging the DataFrames..
Name_x Marks Name_y
0 John 44 John
1 Sam 99 Gifty
Grouping the DataFrame..
Name Marks
0 John 44
Concatenating both the DataFrames..
Name Marks
0 John 44
1 Bran 48
2 Caret 75
3 Joha 33
4 Sam 99
0 John 44
1 Shaun 45
2 Jim 78
3 Gifty 99
Data Visualization in Pandas
The data obtained as output can be further visualized in a better manner by plotting the data.
In order to plot and present the data, we first need to install the matplotlib library for the same.
pip install matplotlib
Example: Data Visualization
import pandas
import numpy as np
input1 = {'Name':pandas.Series(['John','Bran','Caret','Joha','Sam']),
'Marks':pandas.Series([44,48,75,33,99])}
df1 = pandas.DataFrame(input1)
df1.plot.bar()
Output:

Conclusion
Thus, in this tutorial, we have understood the various different methods and functions available within the Python Pandas Module.
References
- Python Pandas module
- Pandas Module Documentation