Pandas merge(): Merging DataFrame or Series objects with a join

Pandas Merge Cover Image

Two Pandas DataFrame or Series objects can be joined together in a database-style operation using the Pandas merge() function. When merging items based on one or more keys, it is utilized.

You will learn about the Pandas merge() method in this lesson as well as how to use its various available parameters to put it into practice.

Recommended Read: Pandas concat(): Concatenate Pandas objects along a particular axis


Prerequisites of Pandas merge()

You need to have Python and Pandas installed on your computer and your favorite IDE set up to start coding. If you don’t have Pandas installed, you can install it using the command:

pip install pandas

If you are using the Anaconda distribution use the command:

conda install pandas

Syntax of Pandas merge()

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
ParameterDescription
leftDataFrame.
rightObject to merge left DataFrame with.
howThe type of join to be performed, allows the values: ‘left’, ‘right’, ‘inner’, ‘outer’, and ‘cross’. The default value is ‘inner’.
onColumn or index level names to join on. These columns must be present in both the DataFrames.
left_onColumn or index level names to join on in the left DataFrame.
right_onColumn or index level names to join on in the right DataFrame.
left_indexUse the index from the left DataFrame as the join key(s).
right_indexUse the index from the right DataFrame as the join key.
sortSort the join keys lexicographically in the result DataFrame. 
copyIf False, avoid copy if possible.
indicatorIf True, adds a column to the output DataFrame called “_merge” with information on the source of each row.
validateIf specified, check if the merge is of the specified type.

Returns:

A resulting DataFrame after performing the merge operation.


Examples of using the Pandas merge() function

Let me now show you how to utilize this function with the help of some examples.

Performing the default merge operation

The below example demonstrates the default merge operation by first creating two data frames and then applying the merge() method to them.

import pandas as pd

data1 = {
    'ID': [1, 2, 3],
    'Name': ['Harry', 'Smith', 'Michelle']
}

df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)

data2 = {
    'Designation': ['Software Developer', 'Teacher', 'Journalist'],
    'Name': ['Alice', 'Smith', 'Michelle']
}

df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)

# default join = 'inner'
df_merged = pd.merge(df1, df2)
print('DataFrame merged:\n', df_merged)

Output:

DataFrame 1:
    ID      Name
0   1     Harry
1   2     Smith
2   3  Michelle
DataFrame 2:
           Designation      Name
0  Software Developer     Alice
1             Teacher     Smith
2          Journalist  Michelle
DataFrame merged:
    ID      Name Designation
0   2     Smith     Teacher
1   3  Michelle  Journalist

Since the default join is an inner join, only the records that are common in both the DataFrames are returned.


Pandas merge() using Left, Right, and Outer Join

The different types of joins return different records as described below:

  • Left join: All records from the left table and only matching records from the right table
  • Right join: All records from the right table and only matching records from the left table
  • Outer join: All records from both left and right tables whether or not there is a match

You can use these join types to merge the DataFrames as demonstrated below:

print('Left Join Output:\n', pd.merge(df1, df2, how='left'))
print('\nRight Join Output:\n', pd.merge(df1, df2, how='right'))
print('\nOuter Join Output:\n', pd.merge(df1, df2, how='outer'))

Output:

Left Join Output:
    ID      Name Designation
0   1     Harry         NaN
1   2     Smith     Teacher
2   3  Michelle  Journalist

Right Join Output:
     ID      Name         Designation
0  NaN     Alice  Software Developer
1  2.0     Smith             Teacher
2  3.0  Michelle          Journalist

Outer Join Output:
     ID      Name         Designation
0  1.0     Harry                 NaN
1  2.0     Smith             Teacher
2  3.0  Michelle          Journalist
3  NaN     Alice  Software Developer

Merge Two DataFrames on a Column

You can specify the column on which you want to join the data objects using the on parameter.

The below example shows the merging of two DataFrames on the columns ID and Name.

import pandas as pd

data1 = {
    'ID': [1, 2, 3],
    'Name': ['Harry', 'Smith', 'Michelle']
}

df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)

data2 = {
    'ID': [1, 2, 3],
    'Designation': ['Software Developer', 'Teacher', 'Journalist'],
    'Name': ['Alice', 'Smith', 'Michelle']
}

df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)

# merge on ID
print('Merged on ID:\n', pd.merge(df1, df2, on='ID'))
# merge on Name
print('Merged on Name:\n', pd.merge(df1, df2, on='Name'))

Output:

DataFrame 1:
    ID      Name
0   1     Harry
1   2     Smith
2   3  Michelle
DataFrame 2:
    ID         Designation      Name
0   1  Software Developer     Alice
1   2             Teacher     Smith
2   3          Journalist  Michelle
Merged on ID:
    ID    Name_x         Designation    Name_y
0   1     Harry  Software Developer     Alice
1   2     Smith             Teacher     Smith
2   3  Michelle          Journalist  Michelle
Merged on Name:
    ID_x      Name  ID_y Designation
0     2     Smith     2     Teacher
1     3  Michelle     3  Journalist
  • Merge on ID: The ID column is used as a key for performing the merge operation. Since the values of Name column differ for the same value of ID, the Name column is presented in two parts Name_x and Name_y.
  • Merge on Name: Here, the Name column is used for performing the merge operation. Only the values in the Name column that are present in both the DataFrames are included in the output. Since ID column is present in both the DataFrames, it is presented in two parts, one for each DataFrame.

Sort the result of the merge operation

You can use the sort parameter to specify whether you want to sort the output lexicographically based on the join keys or not.

import pandas as pd

data1 = {
    'ID': [1, 2, 3],
    'Name': ['Harry', 'Smith', 'Michelle']
}

df1 = pd.DataFrame(data1)
print('DataFrame 1:\n', df1)

data2 = {
    'ID': [1, 2, 3],
    'Designation': ['Software Developer', 'Teacher', 'Journalist'],
    'Name': ['Alice', 'Smith', 'Michelle']
}

df2 = pd.DataFrame(data2)
print('DataFrame 2:\n', df2)

# merge the dataframes
print('Merged DataFrame:\n', pd.merge(df1, df2, on='Name', sort=True))

Output:

DataFrame 1:
    ID      Name
0   1     Harry
1   2     Smith
2   3  Michelle
DataFrame 2:
    ID         Designation      Name
0   1  Software Developer     Alice
1   2             Teacher     Smith
2   3          Journalist  Michelle
Merged DataFrame:
    ID_x      Name  ID_y Designation
0     3  Michelle     3  Journalist
1     2     Smith     2     Teacher

As the code mentions merging the DataFrames on the column Name and sorting it, the output is a Pandas DataFrame that contains records in the lexicographically sorted order of the values in the Name column. Hence, the record for Name=’Michelle’ comes before the record for Name=’Smith’.


Conclusion

In this tutorial, you have learned about the Pandas merge() function. After reading this tutorial, you understood the syntax of the merge() function and also how to use it along with the different parameters supported by it.

Please visit askpython.com for more such easy-to-understand tutorials on Python.


Reference