How to combine CSV files using Python?

Combine Csv Cover Image

Often while working with CSV files, we need to deal with large datasets. Depending on the requirements of the data analysis, we may find that all the required data is not present in a single CSV file. Then the need arises to merge multiple files to get the desired data. However, copy-pasting the required columns from one file to another and that too from large datasets is not the best way to around it.

To solve this problem, we will learn how to use the append, merge and concat methods from Pandas to combine CSV files.


Combining Multiple CSV Files together

To begin with, let’s create sample CSV files that we will be using.

Csv File 1
CSV File 1
Csv File 2
CSV File 2
Csv File 3
CSV File 3

Notice that, all three files have the same columns or headers i.e. ‘name’, ‘age’ and ‘score’. Also, file 1 and file 3 have a common entry for the ‘name’ column which is Sam, but the rest of the values are different in these files.

Note that, in the below examples we are considering that all the CSV files are in the same folder as your Python code file. If this is not the case for you, please specify the paths accordingly while trying out the examples by yourself.
All the examples were executed in a Jupyter notebook.


Different Ways to Combine CSV Files in Python

Before starting, we will be creating a list of the CSV files that will be used in the examples below as follows:

import glob

# list all csv files only
csv_files = glob.glob('*.{}'.format('csv'))
csv_files

Output:

['csv_file_1.csv', 'csv_file_2.csv', 'csv_file_3.csv']

Method 1: append()

Let’s look at the append method here to merge the three CSV files.

import pandas as pd

df_csv_append = pd.DataFrame()

# append the CSV files
for file in csv_files:
    df = pd.read_csv(file)
    df_csv_append = df_csv_append.append(df, ignore_index=True)

df_csv_append

Output:

Output Append And Concat

The append method, as the name suggests, appends each file’s data frame to the end of the previous one. In the above code, we first create a data frame to store the result named df_csv_append. Then, we iterate through the list and read each CSV file and append it to the data frame df_csv_append.


Method 2: concat()

Another method used to combine CSV files is the Pandas concat() method. This method requires a series of objects as a parameter, hence we first create a series of the data frame objects of each CSV file and then apply the concat() method to it.

import pandas as pd

df_csv_concat = pd.concat([pd.read_csv(file) for file in csv_files ], ignore_index=True)
df_csv_concat

An easier-to-understand way of writing this code is:

l = []

for f in csv_files:
    l.append(pd.read_csv(f))
    
df_res = pd.concat(l, ignore_index=True)
df_res

Both the above codes when executed produce the same output as shown below.

Output Append And Concat 1

Notice that the resulting data frame is the same as that of the append() method.


Method 3: merge()

The merge method is used to join very large data frames. A join can be performed on two data frames at a time. We can specify the key based on which the join is to be performed.

It is a good practice to choose a key that is unique for each entry in the data frame, in order to avoid duplication of rows. We can also specify the type of join we wish to perform i.e. either of ‘inner’, ‘outer’, ‘left’, ‘right’ or ‘cross’ join.

We need to first read each CSV file into a separate data frame.

import pandas as pd

df1 = pd.read_csv('csv_file_1.csv')
df2 = pd.read_csv('csv_file_2.csv')
df3 = pd.read_csv('csv_file_3.csv')
Df1
df1
Df2
df2
Df3
df3

Joining df1 and df2:

df_merged = df1.merge(df2, how='outer')
df_merged

Output:

Merge Output 1
Merge Output 1

Joining df1 and df3 based on the key ‘name’.

df_merged = df1.merge(df3, on="name", how='outer')
df_merged

Output:

Merge Output 2
Merge Output 2

df1 and df3, both have an entry for the name ‘Sam’ and the age and score values for both of them are different. Hence, in the resulting data frame, there are columns for representing the entries from both df1 and df3. Since John and Bob are not common in the data frames df1 and df3, their values are NaN wherever applicable.


Conclusion

In this article, we learned about the Pandas methods namely concat, merge and append and how to use them to combine CSV files using Python.


References