Pandas ExcelFile.Parse - All you need to know

When it comes to data analysis, the primary task is to import data into the tool which analyses the data. The tricky part lies in getting the data imported since there is no universal format to consolidate data. Some prefer spreadsheets while others make do with the ‘.csv’ files. Some might even import data directly from websites, while others would like it in the form of ‘.txt’ files.

With the umpteen ways to import data, a data analytic tool should be compatible with the different formats of data repositories available. This article focuses on one such repository which is ubiquitous in recording data – MS Excel. We shall import the details of an MS Excel file using the ExcelFile.Parse( ) function from the pandas library.

So, let us start things off by importing the pandas library using the below code.

import pandas as pd

Thereafter we shall dive into the greater depths of the ExcelFile.Parse( ) function through each of the following sections.

Syntax of the ExcelFile.Parse( ) function
ExcelFile.Parse( ) function explained with an example

Syntax of the ExcelFile.Parse( ) function

Following is the syntax containing the mandatory and optional constructs that are required for the proper functioning of the ExcelFile.Parse( ) function.

pandas.ExcelFile.Parse(sheet_name=0, header=0, names=None, 
index_col=None, usecols=None, sqeeze=None, converters=None,
true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, parse_dates=False,
date_parser=None, thousands=None, comment=None,
skipfooter=0, convert_float=None, mangle_dupe_col=True, **kwds)

where,

sheet_name – Name or page number of the sheet within the MS Excel file containing the data to be imported.
header – Set to zero (0) by default, it is used to specify the row that contains the labels of the data that is to be parsed.
names – Set to ‘None’ by default, it is used to specify the column names that are to be used.
index_col – Set to ‘None’ by default, it is used to specify the column that is to be used as row labels for the parsed data.
usecols – Used to return a subset of the input columns based on whether it is been set to a string or a list of strings/numbers. Being set to ‘None’ by default, it also has the provision to be set as callable.
squeeze – Set to ‘None’ by default, it is used to return a series when the parsed data contains only one column but has been deprecated in the versions following 1.4.0.
converters – Set to ‘None’ by default, it is used to specify the dict of functions that can covert the values in specific columns.
true_values – Set to ‘None’ by default, it is used to specify the values that are to be considered ‘True’ when parsing the data.
false_values – Set to ‘None’ by default, it is used to specify the values that are to be considered ‘False’ when parsing the data.
skiprows – Set to ‘None’ by default, it is used to specify the line numbers or the number of lines to be skipped from the start of the file.
nrows – Set to ‘None’ by default, it is used to specify the number of rows to be parsed.
na_values – Set to ‘None’ by default, it is used to specify the strings that are to be considered as NA values such as ‘#N/A’, ‘#NA’, ‘NaN’, ‘na’ & so on.
parse_dates – Set to ‘False’ by default, it is used to specify the list of dates within the input data that are to be parsed.
date_parser – Set to ‘None’ by default, it is used to convert columns containing strings into an array of ‘datetime’ format.
thousands – Set to ‘None’ by default, it is used to include a thousand separator for the column values stored in ‘Text’ format in the input data.
comment – Set to ‘None’ by default, it is used to highlight the locations of comments in the input MS Excel file by passing a character(s).
skipfooter – Set to zero (0) by default, it is used to skip the rows at the end of the input data.
convert_float – Set to ‘None’ by default, it is used to convert data stored as ‘float’ into ‘int’, but is deprecated in versions following 1.3.0.
mangle_dupe_col – Set to ‘True’ by default, it is used to specify duplicate columns as ‘X’, ‘X.1’, …. ‘X.N’ & is used to overwrite the data in the duplicate columns if set to ‘False’. It is deprecated in versions following 1.5.0.
**kwds – kwargs or keyword argument which is an optional construct used to pass keyword variable length of argument to a function.

ExcelFile.Parse( ) function explained with an example

Let us first specify the location of the file containing the data that is to be parsed within the ExcelFile( ) function as given below.

XL = pd.ExcelFile('D:\Ask Python\Examples.xlsx')

Once done, it is time to parse the data within this file, specifically the data of interest are those that fall within the first three columns of the first sheet. This can be done using the following code.

df = XL.parse('Sheet1', usecols = [1, 2, 3])
print (df)

Printing the dataframe yields the below result.

Now let’s say we would like the very first column of the output, those with ‘0, 1, 2 & 3’ to be replaced with the contents of those in the second column then, the same can be done as follows.

df = XL.parse('Sheet1', usecols = [1, 2, 3], index_col = 0)
print (df)

Suppose only the results of the first three rows are only needed, one can use the nrows option to make it happen.

df = XL.parse('Sheet1', usecols = [1, 2, 3], index_col = 0, nrows = 3)
print (df)

Conclusion

Now that we have reached the end of this article, hope it has elaborated on how to use the ExcelFile.Parse( ) function from the pandas library. Here’s another article that details the usage of the convolve( ) function from the numpy library in Python. There are numerous other enjoyable and equally informative articles in AskPython that might be of great help to those who are looking to level up in Python. Audere est facere!