Two Pandas DataFrame or Series objects can be joined together in a database-style operation using the Pandas merge()
function. When merging items based on one or more keys, it is utilized.
You will learn about the Pandas merge() method in this lesson as well as how to use its various available parameters to put it into practice.
Recommended Read: Pandas concat(): Concatenate Pandas objects along a particular axis
Prerequisites of Pandas merge()
You need to have Python and Pandas installed on your computer and your favorite IDE set up to start coding. If you don’t have Pandas installed, you can install it using the command:
pip install pandas
If you are using the Anaconda distribution use the command:
conda install pandas
Syntax of Pandas merge()
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Parameter | Description |
left | DataFrame. |
right | Object to merge left DataFrame with. |
how | The type of join to be performed, allows the values: ‘left’, ‘right’, ‘inner’, ‘outer’, and ‘cross’. The default value is ‘inner’. |
on | Column or index level names to join on. These columns must be present in both the DataFrames. |
left_on | Column or index level names to join on in the left DataFrame. |
right_on | Column or index level names to join on in the right DataFrame. |
left_index | Use the index from the left DataFrame as the join key(s). |
right_index | Use the index from the right DataFrame as the join key. |
sort | Sort the join keys lexicographically in the result DataFrame. |
copy | If False, avoid copy if possible. |
indicator | If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. |
validate | If specified, check if the merge is of the specified type. |
Returns:
A resulting DataFrame after performing the merge operation.
Examples of using the Pandas merge() function
Let me now show you how to utilize this function with the help of some examples.
Performing the default merge operation
The below example demonstrates the default merge operation by first creating two data frames and then applying the merge() method to them.
import pandas as pd
data1 = {
'ID': [1, 2, 3],
'Name': ['Harry', 'Smith', 'Michelle']
}
df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)
data2 = {
'Designation': ['Software Developer', 'Teacher', 'Journalist'],
'Name': ['Alice', 'Smith', 'Michelle']
}
df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)
# default join = 'inner'
df_merged = pd.merge(df1, df2)
print('DataFrame merged:\n', df_merged)
Output:
DataFrame 1:
ID Name
0 1 Harry
1 2 Smith
2 3 Michelle
DataFrame 2:
Designation Name
0 Software Developer Alice
1 Teacher Smith
2 Journalist Michelle
DataFrame merged:
ID Name Designation
0 2 Smith Teacher
1 3 Michelle Journalist
Since the default join is an inner join, only the records that are common in both the DataFrames are returned.
Pandas merge() using Left, Right, and Outer Join
The different types of joins return different records as described below:
- Left join: All records from the left table and only matching records from the right table
- Right join: All records from the right table and only matching records from the left table
- Outer join: All records from both left and right tables whether or not there is a match
You can use these join types to merge the DataFrames as demonstrated below:
print('Left Join Output:\n', pd.merge(df1, df2, how='left'))
print('\nRight Join Output:\n', pd.merge(df1, df2, how='right'))
print('\nOuter Join Output:\n', pd.merge(df1, df2, how='outer'))
Output:
Left Join Output:
ID Name Designation
0 1 Harry NaN
1 2 Smith Teacher
2 3 Michelle Journalist
Right Join Output:
ID Name Designation
0 NaN Alice Software Developer
1 2.0 Smith Teacher
2 3.0 Michelle Journalist
Outer Join Output:
ID Name Designation
0 1.0 Harry NaN
1 2.0 Smith Teacher
2 3.0 Michelle Journalist
3 NaN Alice Software Developer
Merge Two DataFrames on a Column
You can specify the column on which you want to join the data objects using the on parameter.
The below example shows the merging of two DataFrames on the columns ID and Name.
import pandas as pd
data1 = {
'ID': [1, 2, 3],
'Name': ['Harry', 'Smith', 'Michelle']
}
df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)
data2 = {
'ID': [1, 2, 3],
'Designation': ['Software Developer', 'Teacher', 'Journalist'],
'Name': ['Alice', 'Smith', 'Michelle']
}
df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)
# merge on ID
print('Merged on ID:\n', pd.merge(df1, df2, on='ID'))
# merge on Name
print('Merged on Name:\n', pd.merge(df1, df2, on='Name'))
Output:
DataFrame 1:
ID Name
0 1 Harry
1 2 Smith
2 3 Michelle
DataFrame 2:
ID Designation Name
0 1 Software Developer Alice
1 2 Teacher Smith
2 3 Journalist Michelle
Merged on ID:
ID Name_x Designation Name_y
0 1 Harry Software Developer Alice
1 2 Smith Teacher Smith
2 3 Michelle Journalist Michelle
Merged on Name:
ID_x Name ID_y Designation
0 2 Smith 2 Teacher
1 3 Michelle 3 Journalist
- Merge on ID: The ID column is used as a key for performing the merge operation. Since the values of Name column differ for the same value of ID, the Name column is presented in two parts Name_x and Name_y.
- Merge on Name: Here, the Name column is used for performing the merge operation. Only the values in the Name column that are present in both the DataFrames are included in the output. Since ID column is present in both the DataFrames, it is presented in two parts, one for each DataFrame.
Sort the result of the merge operation
You can use the sort parameter to specify whether you want to sort the output lexicographically based on the join keys or not.
import pandas as pd
data1 = {
'ID': [1, 2, 3],
'Name': ['Harry', 'Smith', 'Michelle']
}
df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)
data2 = {
'ID': [1, 2, 3],
'Designation': ['Software Developer', 'Teacher', 'Journalist'],
'Name': ['Alice', 'Smith', 'Michelle']
}
df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)
# merge the dataframes
print('Merged DataFrame:\n', pd.merge(df1, df2, on='Name', sort=True))
Output:
DataFrame 1:
ID Name
0 1 Harry
1 2 Smith
2 3 Michelle
DataFrame 2:
ID Designation Name
0 1 Software Developer Alice
1 2 Teacher Smith
2 3 Journalist Michelle
Merged DataFrame:
ID_x Name ID_y Designation
0 3 Michelle 3 Journalist
1 2 Smith 2 Teacher
As the code mentions merging the DataFrames on the column Name and sorting it, the output is a Pandas DataFrame that contains records in the lexicographically sorted order of the values in the Name column. Hence, the record for Name=’Michelle’ comes before the record for Name=’Smith’.
Conclusion
In this tutorial, you have learned about the Pandas merge()
function. After reading this tutorial, you understood the syntax of the merge()
function and also how to use it along with the different parameters supported by it.
Please visit askpython.com for more such easy-to-understand tutorials on Python.