How to Concatenate Multiple CSV Files Into One Data Frame?

Concatenate Multiple CSV Files Into A Single Data Frame

Suppose you are working on your dream project and have loads of data related to each other but spread across multiple CSV files. You would want to analyze and manipulate all this data simultaneously and wish it to be hassle-free.

One approach to this issue would be importing all the CSV files you need or want to use and concatenating these files into a single data frame.

Before we move on to concatenating the CSV files, let us learn about the Pandas Data Frame, CSV file format, and the method used to concatenate the files.

What is a Data Frame?

When discussing the Pandas library, it is a must to talk about one of its most used data structures- Data Frame.

Refer to this article to learn more about the data structures of the Pandas Library.

A data frame is a two-dimensional table that stores data in rows and columns.

It is similar to a matrix in linear algebra but can do much more than a matrix. It can store different data types such as numeric, categorical, character, float, and many more to name which makes it easy to handle data with multiple data types.

Let us see an example of a data frame.

#example of a data frame
import pandas as pd
electronics_data = {
    'Date': ['2022-01-01', '2022-01-02', '2022-01-03', '2022-01-04'],
    'Salesperson': ['Alice', 'Bob', 'Charlie', 'Alice'],
    'Product': ['Buds', 'Speaker', 'Smartphone', 'Mac'],
    'Price': [2.99, 7.50,9.99,12.25]}
df = pd.DataFrame(electronics_data)
print(df)

The Pandas library is used to work with data and storage of the data. The first line, we can see that it is being imported as pd.

Next, we create a dictionary of values stored in a key-value pair format. This dictionary shows us the sales data of an electronics shop. It shows us the customer’s name, date of purchase, and the item that has been purchased.

Now, this is a dictionary. We need a Data frame. Luckily, we have the Pandas library, which has ample inbuilt functions that support converting any data storage format into a data frame and vice-versa. One such method is pd.DataFrame.

We can pass any data format to this method as an argument, and boom! we have a data frame.

The data frame is shown below.

Data Frame
Data Frame

What is a CSV File?

CSV stands for Comma Separated Values. As self-describing as it is, the CSV stores data as rows, separated by a comma.

The CSV file format stores the data in a simple format. Such storage can support data exchange between different platforms and even languages.

Let us see an example of a CSV file and how we can import a CSV file using the Pandas library.

Look at the IPL team dataset. When downloaded, it looks something like this.

Csv Example
Csv Example

It has all the necessary information about players considered to be selected for the team. If you notice, all the entries are separated by a comma.

How to import this dataset, you wonder? We’ve got you!

The code to import the CSV dataset is given below.

#CSV example
import pandas as pd
df=pd.read_csv('/content/drive/MyDrive/Player.csv')
df

Firstly, we import the essential libraries to our environment to work with.

Next, we create a new variable to store the CSV file using the pd.read_csv. The file we wish to read should be downloaded from the source, and in the same environment we work in. The path of the file is passed as an argument to the read_csv function.

Pandas read_csv(): Read a CSV File into a DataFrame

The output is a data frame.

Importing The CSV File
Importing The CSV File

Since Google Colaboratory is used here, the file is uploaded to Google Drive, which is mounted in Colab for easy access. The file we use is Player.csv, whose path is copied and passed as an argument to the read function.

As you can see from the data frame, the last column Unnamed:7 is completely filled with NaN values. When you see such columns, you have two approaches. Either fill the NaN values with zero or any strategy like mean, median, or just drop the column.

If you observe the attributes of the data frame, it might occur to you that this column is completely useless as it is unrelated to other labels.

In such cases, you can drop the column.

Here is the code and output after cleaning.

df.drop('Unnamed: 7',axis=1)

The drop function drops unnecessary rows and columns from a data frame. If a column is dropped, the axis is equal to ‘1’. If a row is dropped, the axis is equal to ‘0’.

The cleaned data frame is shown below.

df.drop
df.drop

Now that we have covered the basics of the data frames and CSV, let us see an important function used to concatenate the data frames- pd.concat.


Exploring the syntax of pd. concat

This method is used to concatenate Pandas objects along the specified axis.

The syntax is given below.

pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

The important arguments of the syntax are

objs : The sequence of Pandas objects to be concatenated
axis: This argument specifies the axis on which the concatenation should take place.If the index of the objects is used to concatenate, the axis should be 0. If the objects’ columns are to be used to concatenate, this argument should be equal to 1.
ignore_index: If this argument is set to True, the resulting data frame will have an unnamed index column
starting from 0.
The resulting data frame will not have an extra index column if it is set to False.

The other parameters are:
join: It tells what indices to include.
If the join is ‘outer’, the union of the indices is used.
If the join is ‘inner’, the intersection of the indices is used.

keys: Used to construct a hierarchical index.
This argument might be helpful when we try to concatenate objects with columns or labels representing different entries.
sort: This argument specifies if the data frame should be sorted. By default, it is False.
copy: By default, this argument is set to True, which means a new object is created for the data.

Return Type: object, type of objs. If axis=0, a series is created. If the axis=1, a data frame is created.

Let us see the usage of pd.concat in the following examples.


Importing Multiple CSV Files, Creating Separate Data Frames for Each, and Concatenating Them.

In this example, let us take two related CSV files, import them using read_csv separately, and concatenate their respective data frames.

The datasets we will deal with in this example are related to the COVID-19 data of the United Kingdom(UK).

We have three datasets related to one another. Let us go through what each of them has in store for us.

Let us see how to import and create a data frame for each dataset.

England_Regions_COVID_Dataset.csv

England_Regions_COVID_Dataset.csv
It has the following labels associated with it.
date– the date on which the survey is released.
areaName-The area in which the cases are recorded.
newDeaths28DaysByDeathDate-New death cases within 28days of a positive COVID-19 test.
cumDeaths28DaysByDeathDate-Cumulative number of deaths within 28 days of a positive COVID-19 test.

#reading multiple csv files 
#csv1
import pandas as pd
df1=pd.read_csv('/content/drive/MyDrive/England_Regions_COVID_Dataset.csv')
df1

The first dataset is read into a variable called df1.

The data frame corresponding to this data set is given below.

Data Frame 1.1
Data Frame1.1

UK_National_Total_COVID_Dataset.csv

The attributes in this dataset are explained below.

date– The date on which the survey is released.
areaName-The area in which the cases are recorded.
newCasesByPublishDate-The number of new cases registered on or before the publish date.
cumCasesByPublishDate-The cumulative cases registered on or before the publish date.
newAdmissions-The number of people admitted to the hospital.
hospitalCases-Already existing n the hospital.
covidOccupiedMVBeds-Number of beds occupied in the hospitals.
cumPeopleVaccinatedFirstDoseByPublishDate -Cumulative number of people who got their first shot of vaccine on or before the publish date.
cumPeopleVaccinatedSecondDoseByPublishDate-Cumulative number of people who got their second shot of vaccine on or before the publish date.

#csv2
df2=pd.read_csv('/content/drive/MyDrive/UK_National_Total_COVID_Dataset.csv')
df2

The second data set is read into a new variable called df2 and then printed.

Data Frame1.1
Data Frame1.2

NEW_Official_Population_Data_ONS_mid-2019.csv

This dataset has the following attributes.
areaCode-Postal code of the area.
areaName– Name of the area.
areaType– Type of the area (country, LocalGovernment District).
Area-The area of the region in square kilometers.
Population– The population of the area.

#csv3
df3=pd.read_csv('/content/drive/MyDrive/NEW_Official_Population_Data_ONS_mid-2019.csv')
df3

The data frame of the above CSV file is as shown.

Data Frame 1.3
Data Frame 1.3

Concatenating the Three Data Frames

To concatenate the data frames, we use the pd.concat method.

The code is given below.

#concatenated data frame
df4=pd.concat([df1,df2,df3], axis=0, ignore_index=True)
df4

The three data frames are passed a list to the pd.concat method.

The concatenated data frame is shown below.

Concatenation of Multiple CSV Files 1
Concatenation of Multiple CSV Files 1

There are so many NaN values at the beginning of some columns because the first data frame may not have those attributes.


Using the Map Function to Concatenate Multiple Data Frames All at Once

Following the previous example would consume our time. What if we have a function that can read all the CSV files at once and return the concatenated data frame?

A function called map is used to apply a specific function for a sequence of objects.

The syntax of the map function is as follows.

map(function,iterable....)

Let us see the usage of this function.

But before that let us know about the data sets used in this example.

The two data sets we are going to see are spam detection datasets.

emails.csv

This dataset has around 3000 columns.The first column of this dataset is the email number, while the last column is the test label(spam-1 and not spam-0). The remaining columns are the lables for most frequently occurring words in emails.

Some of these words include the, to, and, for, of, a, you, in, on, is, this, I, be, that, will

spambase_csv.csv

This dataset has the frequencies of some commonly occuring words in an email and a label to In this predict if the email is spam or not spam.(1 for spam and 0 for not spam).

Let us see the code concatenating these two files into one data frame.

import pandas as pd
df = pd.concat(map(pd.read_csv, ['/content/drive/MyDrive/emails.csv','/content/drive/MyDrive/spambase_csv.csv']))
df

In this cide snippet, we have first imported the Pandas library as pd.

Next we have used the map function to apply the pd.read_csv function for the two data sets.

If compared with the syntax of the map function we have,

Function-read_csv
Iterables-emails.csv,spambase_csv.csv.

The concatenated data frame is stored in the variable called df.

Concatenation of Multiple CSV Files 2
Concatenation of Multiple CSV Files 2

Using a for Loop to Concatenate All the Files in a Directory

In this example, we are going to create a folder for the CSV files we are going to need and then run a for loop to read each file and use an empty list to concatenate them.

Let us first know what data sets we are dealing with.

Ball_by_Ball.csv

This data set is related to a certain match where each ball thrown is recorded.It has the following attributes.

  • Match_Id
  • Innings_Id
  • Over_Id
  • Ball_Id
  • Team_Batting_Id
  • Team_Bowling_Id
  • Striker_Id
  • Striker_Batting_Position
  • Non_Striker_Id
  • Bowler_Id

Match.csv

This data set contains details of all the matches played by certain teams.

It has the following lables.

  • Match_Id
  • Match_Date
  • Team_Name_Id
  • Opponent_Team_Id
  • Season_Id
  • Venue_Name
  • Toss_Winner_Id
  • Toss_Decision
  • IS_Superover
  • IS_Result Is_DuckWorthLewis
  • Win_Type
  • Won_By Match_Winner_Id
  • Man_Of_The_Match_Id
  • First_Umpire_Id
  • Second_Umpire_Id
  • City_Name
  • Host_Country

Player_Match.csv

This data set is the smallest of all. It contains the details of the players of the match.

The labels in this dataset are:

  • Match_Id
  • Player_Id
  • Team_Id
  • Is_Keeper
  • Is_Captain

Now let us see the code

import pandas as pd
import os
path='/content/drive/MyDrive/IPL'
files = [f for f in os.listdir(path) if f.endswith('.csv')]
li = []
for file in files:
    df = pd.read_csv(os.path.join(path, file))
    li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
df.columns

In the first two lines we are importing the pandas library and the os module.

The os module is used to interact with the Operating System.

Next, we create a variable called path to store the path of the directory in which the CSV files reside.

Next, we create a variable called files that is used as an iterator in for loop to read all the CSV files.

The listdir of the os module is used to return all the files in a directory.

We are only going to read the files that end with .csv extension.

An empty list is created to store the result of reading the files.

All the files are read using pd.read_csv and are stored in the variable called df.

The data frames are apended to the empty list li.

Next, the concat method is used to get all the data frames into a single data frame.

df.columns is used to print the name of all the columns in the data frame.

Df Columns
Df Columns

The concatenated data frame is given below.

Concatenation of Multiple CSV Files 3
Concatenation of Multiple CSV Files 3

Conclusion

To summarize everything we have seen in this post, we learnt about the Pandas Data Frame with the help of an example, the CSV file format and also importing the CSV files into our environment using pd.read_csv.

Next, we have seen the need to concatenate multiple CSV files into a single data frame.

We have also observed how a CSV file looks like when downloaded. We have also learnt how to drop irrelevant data frame columns using df,drop function.

We have learnt the syntax of the main character of this post- pd.concat and understood its parameters in detail.

Coming to the examples, firstly, we have seen the naive and time taking approach for this problem.

We have imported different CSV files, read them and obtained a data frame for each of them. All these data frames are then passed as argument to the pd.concat method.

Coming to the second example, we have seen a better approach.

We used the map function to apply the pd.read_csv method to all the files at a time.

We have also seen the syntax of the map function.

Lastly, we created a directory to store all the files at one place and used the functions of os module and a for loop to read the CSV files.


Datasets

You can find the IPL dataset used in the example for CSV and also the last example here.

Find the UK COVID-19 survey data set used in the fiirst example here.

Find the emails.csv file here.

You can download the spam email classification file from here.


References

Visit the Pandas official documentation to know more about the concat method.

Also check out the documentation for DataFrame.drop.

Check out this illustration of the map function.