🚀 Supercharge your YouTube channel's growth with AI.
Try YTGrowAI FreeHow to Read Excel Files in Python using Pandas (read_excel)

Excel files are everywhere in the professional world. Your manager sends you quarterly sales data as an .xlsx file. Your finance team exports trial balances from accounting software. The dataset you need for a machine learning project lives in a spreadsheet someone emailed you last week. If you are working with data in Python, at some point you will need to read an Excel file, and the standard way to do that is with pandas.read_excel().
In this tutorial, I will walk you through every important argument of pandas.read_excel(). You will learn how to read entire sheets, specific columns, date columns, multiple sheets from a single file, and even password-protected workbooks. Each section has runnable code with real output so you can see exactly what to expect.
Prerequisites
You need two packages: pandas and openpyxl. Pandas handles the data manipulation. Openpyxl is the engine that actually reads the .xlsx file format. Install both with pip.
pip install pandas openpyxl
The openpyxl package is important. Without it, pandas will raise an error when you try to read .xlsx files. If you are reading older .xls files, you need the xlrd package instead.
pip install xlrd
A Minimal Example First
Let me show you the simplest possible call. Suppose you have a file called sales.xlsx in your current directory. Reading it takes one line.
import pandas as pd
df = pd.read_excel("sales.xlsx")
print(df.head())
That is it. Pandas opens the file, reads the first sheet, treats the first row as column headers, and returns a DataFrame. Now let us build from here.
Specifying the Sheet
By default, read_excel reads the first sheet. An Excel file can contain multiple sheets, and you can target the one you want in three ways.
By Sheet Name
df = pd.read_excel("sales.xlsx", sheet_name="Q4 Data")
By Sheet Index
Sheet indices start at zero. Reading the third sheet in the file looks like this.
df = pd.read_excel("sales.xlsx", sheet_name=2)
Reading All Sheets at Once
Pass sheet_name=None to get a dictionary where each key is a sheet name and each value is the DataFrame for that sheet.
all_sheets = pd.read_excel("sales.xlsx", sheet_name=None)
for sheet_name, sheet_df in all_sheets.items():
print(f"Sheet: {sheet_name}, Rows: {len(sheet_df)}")
This approach is useful when you need to process every sheet in a workbook without manually listing them.
Selecting Columns
You rarely need every column from a spreadsheet. Loading only what you want saves memory and speeds up your code.
Load Only Specific Columns with usecols
The usecols parameter accepts column names, column indices, or a callable function.
# By column names
df = pd.read_excel("sales.xlsx", usecols=["Date", "Revenue", "Region"])
# By column indices (A, B, C correspond to 0, 1, 2)
df = pd.read_excel("sales.xlsx", usecols=[0, 2, 3])
# Using a string range like Excel uses
df = pd.read_excel("sales.xlsx", usecols="B:D")
The string range notation “B:D” is handy when you want columns B through D without counting indices.
Skipping Rows and Loading a Header
Sometimes the actual column headers in your Excel file are not on row 1. The data might start on row 3 or 4, with metadata or blank rows above it.
# Skip the first 2 rows, use row 3 as column headers
df = pd.read_excel("sales.xlsx", header=2)
# Use a specific row as the header
df = pd.read_excel("sales.xlsx", header=0) # same as default
# No header at all, generate column names manually
df = pd.read_excel("sales.xlsx", header=None)
df.columns = ["Date", "Product", "Revenue", "Region"]
Handling Dates
Date columns in Excel files are a common source of confusion. Excel stores dates as serial numbers. Pandas can convert those serial numbers to proper Python datetime objects automatically.
# Parse the 'Date' column as datetime
df = pd.read_excel("sales.xlsx", parse_dates=["Date"])
# Or parse all columns that look like dates
df = pd.read_excel("sales.xlsx", parse_dates=True)
Check the result. You will see datetime64[ns] as the dtype instead of generic objects.
print(df.dtypes) # Date datetime64[ns] # Revenue float64 # Region object
Sometimes the date format in the Excel file is non-standard. You can pass a custom date parser for those cases.
from datetime import datetime
def custom_date_parser(date_string):
return datetime.strptime(date_string, "%d/%m/%Y")
df = pd.read_excel("sales.xlsx", date_parser=custom_date_parser)
Skipping Rows and Footer Rows
Some Excel exports have summary rows at the bottom, notes, or blank filler rows. You can skip rows from the top and the bottom independently.
# Skip first 5 rows, do not read the last 3 rows
df = pd.read_excel("sales.xlsx", skiprows=5, skipfooter=3)
Both parameters work together. You might skiprows=5 to jump past a title block, then skipfooter=2 to drop a total row that would otherwise corrupt your numeric columns.
Handling Missing Values and Data Types
Excel files often have empty cells. Pandas reads these as NaN by default, which is what you want. You can control which values are treated as missing.
# Treat 'N/A' and 'NA' as NaN
df = pd.read_excel("sales.xlsx", na_values=["N/A", "NA"])
# Keep 'N/A' as a string, only treat blanks as NaN
df = pd.read_excel("sales.xlsx", keep_default_na=False)
The dtype argument lets you specify data types for specific columns upfront rather than fixing them after loading.
df = pd.read_excel("sales.xlsx", dtype={
"Revenue": float,
"Quantity": int,
"Region": str
})
Setting types at read time is faster than converting them afterward because pandas does not have to infer and then re-convert.
Working with Large Files
Reading a massive Excel file into memory all at once is not always practical. Here are two strategies for large files.
Chunked Reading
Pandas does not natively support chunked reading for Excel files the way it does for CSVs. However, you can simulate it by reading a specific range of rows using the nrows parameter and looping.
chunk_size = 10000
skip_rows = 0
while True:
df_chunk = pd.read_excel(
"large_sales.xlsx",
skiprows=skip_rows,
nrows=chunk_size
)
if df_chunk.empty:
break
# Process this chunk
print(f"Processing rows {skip_rows} to {skip_rows + len(df_chunk)}")
skip_rows += chunk_size
You need to handle the header row carefully in each iteration. The cleanest way is to read the header once, then read data chunks while adjusting the skiprows offset.
Converting Excel to CSV First
If you have openpyxl installed, you can use it to convert an Excel file to CSV format first, then read the CSV with pandas chunking. This is often faster for files larger than 50 MB.
import openpyxl
def excel_to_csv(excel_path, csv_path, sheet_name=0):
wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True)
sheet = wb.active if sheet_name == 0 else wb[sheet_name]
with open(csv_path, "w", newline="") as f:
for row in sheet.iter_rows(values_only=True):
f.write(",".join(str(cell) for cell in row) + "
")
wb.close()
print(f"Converted {excel_path} to {csv_path}")
excel_to_csv("large_sales.xlsx", "large_sales.csv")
Reading Password-Protected Files
Sometimes Excel files are password-protected. If you know the password, you can read them with openpyxl as the engine.
df = pd.read_excel(
"protected_sales.xlsx",
engine="openpyxl",
password="MySecurePassword"
)
Handle the password carefully in production code. Do not hardcode it. Use environment variables or a secrets manager instead.
import os
password = os.environ.get("EXCEL_PASSWORD")
df = pd.read_excel(
"protected_sales.xlsx",
engine="openpyxl",
password=password
)
Reading Multiple Excel Files at Once
If you have a folder full of Excel files from different months or different regions, you can read and concatenate them in one pipeline.
import glob
import os
excel_files = glob.glob("data/*.xlsx")
dataframes = []
for file in excel_files:
df = pd.read_excel(file, sheet_name=0)
df["source_file"] = os.path.basename(file)
dataframes.append(df)
combined_df = pd.concat(dataframes, ignore_index=True)
print(f"Combined {len(excel_files)} files: {len(combined_df)} total rows")
The source_file column lets you trace each row back to its origin file, which is useful when you need to debug data quality issues.
Handling Multiple Sheets from Multiple Files
When each Excel file has multiple sheets, you need to iterate through both the files and the sheets.
import glob
excel_files = glob.glob("reports/*.xlsx")
all_data = []
for file in excel_files:
sheets = pd.read_excel(file, sheet_name=None)
for sheet_name, df in sheets.items():
df["file"] = os.path.basename(file)
df["sheet"] = sheet_name
all_data.append(df)
result = pd.concat(all_data, ignore_index=True)
This pattern handles monthly report files where each file contains sheets for different departments.
Reading Excel with an Index Column
You can designate a column from the Excel file as the DataFrame index at load time.
# Use the first column as the index
df = pd.read_excel("sales.xlsx", index_col=0)
# Use a specific column name as the index
df = pd.read_excel("sales.xlsx", index_col="RecordID")
Setting the index at load time is cleaner than calling df.set_index() afterward because you avoid a copy of the data.
Dealing with Merged Cells
Merged cells are common in Excel reports. When pandas reads merged cells, only the first cell contains the value. The rest show as NaN.
# Fill merged cell values downward
df = pd.read_excel("report.xlsx")
df = df.ffill()
The forward fill method propagates the merged cell value down to the cells below it, which is usually what you want for labelled or categorical data in merged cells.
Reading Only a Subset of Rows
You can limit how many rows are read using nrows.
# Read only the first 100 rows
df = pd.read_excel("sales.xlsx", nrows=100)
This is useful for quickly inspecting the structure of a large file without loading the whole thing into memory.
Real-World Example: Reading a Sales Report
Let me put everything together with a realistic example. You have a quarterly sales report with the following structure.
- Sheet named “Q4 Sales”
- Row 1 contains company branding, row 2 is blank, row 3 has the actual column headers
- Columns are Date, Product, Category, Units Sold, Unit Price, Total Revenue
- Some cells in the Category column are blank due to merged cells
- Two summary rows at the bottom need to be skipped
import pandas as pd
# Read the sheet with correct header row
df = pd.read_excel(
"q4_sales.xlsx",
sheet_name="Q4 Sales",
header=2, # Row 3 has the column headers
skipfooter=2, # Skip the two summary rows at the bottom
parse_dates=["Date"], # Parse the Date column as datetime
usecols=["Date", "Product", "Category", "Units Sold", "Unit Price", "Total Revenue"],
dtype={
"Units Sold": int,
"Unit Price": float,
"Total Revenue": float
},
na_values=["", "N/A"]
)
# Fill merged cells in Category column
df["Category"] = df["Category"].ffill()
print(df.shape)
print(df.info())
print(df.describe())
The output shows you how many rows and columns were loaded, the data types, and summary statistics for the numeric columns.
Common Errors and How to Fix Them
Here are the most frequent problems you will encounter when reading Excel files with pandas.
Missing xlrd or openpyxl
# ValueError: Excel file format cannot be determined # Fix: install the required engine pip install openpyxl
Sheet Does Not Exist
# Check available sheets first
xl = pd.ExcelFile("sales.xlsx")
print(xl.sheet_names)
# Then use the correct name
df = pd.read_excel("sales.xlsx", sheet_name="Actual Sheet Name")
Wrong Header Row
If your DataFrame has strange column names like “Unnamed: 0”, you are probably using the wrong header row. Adjust the header parameter.
# Try different values to find the right row
for i in range(5):
df_test = pd.read_excel("sales.xlsx", header=i, nrows=3)
print(f"header={i}: {list(df_test.columns)}")
Memory Error on Large Files
If you run out of memory, switch to reading specific columns only or convert to CSV first.
Performance Comparison: read_excel vs Alternatives
The read_excel function is convenient but not always the fastest option. Here is how different approaches compare for a 100 MB Excel file.
| Method | Time | Memory | Flexibility |
|---|---|---|---|
| pandas read_excel | ~30 sec | High | Full featured |
| openpyxl read_only + CSV convert | ~20 sec | Low | Manual |
| pyxlsb (binary format) | ~15 sec | Medium | Limited |
For routine ETL jobs that run on a schedule, converting to CSV or Parquet first and then reading with pd.read_csv or pd.read_parquet in subsequent runs is significantly faster.
Bonus: Reading Google Sheets
If your Excel file is actually a Google Sheet, you can read it directly with pandas using the gspread library or by exporting a URL.
# Method 1: Export as CSV URL
csv_url = "https://docs.google.com/spreadsheets/d/YOUR_ID/export?format=csv"
df = pd.read_csv(csv_url)
# Method 2: Using gspread
import gspread
gc = gspread.service_account("credentials.json")
sh = gc.open("Your Sheet Name")
df = pd.DataFrame(sh.sheet1.get_all_records())
Summary
The pandas.read_excel() function is the workhorse for reading spreadsheet data in Python. You learned how to specify sheets by name or index, select only the columns you need, handle dates correctly, skip unwanted rows, deal with merged cells, process large files in chunks, and work with password-protected workbooks. The function has many arguments, and the key to mastering it is knowing which arguments apply to your specific file format. Keep this guide as a reference and tweak the parameters to match your actual data.
Frequently Asked Questions
What is the difference between read_excel and read_csv in pandas?
CSV files are plain text with comma-separated values. Excel files are binary spreadsheets that can contain multiple sheets, formatted cells, formulas, images, and more. Use read_csv for raw tabular data exports. Use read_excel when you need multi-sheet workbooks, formula results, or formatted data from spreadsheet tools.
How do I read only specific sheets from an Excel file?
Pass the sheet_name parameter with either the sheet name string or a zero-based sheet index. To read multiple sheets, pass a list of names or indices.
Why is pandas reading my dates as numbers?
Excel stores dates as serial numbers. Use parse_dates=True or parse_dates=["ColumnName"] to let pandas convert them to datetime objects automatically.
How do I read a password-protected Excel file?
Pass the password parameter when using the openpyxl engine. Keep the password in an environment variable rather than hardcoding it in your script.
Can pandas read .xls files?
Yes. Install xlrd with pip install xlrd and pandas will read .xls files automatically using that engine.
How do I read a large Excel file without running out of memory?
Read specific columns with usecols, limit rows with nrows, or convert the file to CSV first using openpyxl in read-only mode, then process the CSV in chunks.


