Python Pandas Dynamically Create a Dataframe

Pandas is a fast, powerful tool for analyzing data, built on top of Python programming language. While studying machine learning, you will be majorly using Pandas. It is used to explore, analyze and manipulate large data sets, it also helps in getting our data ready for machine learning.

Now, with so many similar tools available in today’s time, the question is, why do we use Pandas? The answer to this question is nothing but the very easy learning curve of pandas, it is extremely easy to use and also integrated with many other data science and Machine Learning Python tool.

It has two major data types that are generally used: ‘ series’ and a ‘dataframe’, which we will explore in depth. To start working with pandas we first need to install it using a package manager and then simply import it within the program we would want to use pandas into.

pip install pandas

import pandas as pd

Pandas is generally imported as ‘pd’, which is followed as a standard practice everywhere.

To get a deeper understanding of pandas refer to this article here.

Importance of Dynamically Creating a DataFrame in Pandas

Dataframe, as we know, are important elements of the pandas library. There are various ways in which dataframes can be created, the one which we are interested in, is the dynamic way of creating them.

Dynamically creating a dataframe is important in cases where we don’t know the size of the dataframe when we create it, or maybe you would like to rename the headers dynamically without a tedious process in the background.

Creating a dataframe dynamically is useful for various other cases that we will understand as we go ahead.

Understanding Pandas DataFrame

DataFrame is the primary data structure of Pandas. It is a structure that contains named rows and columns, on which arithmetic operations can be aligned.

To create a dataframe, the below syntax can be used:

pd.DataFrame(data, index, columns, dtypye, copy)

This is a general syntax in pandas to create a dataframe. We shall also see how to create an actual dataframe in practice in the further section.

A dataframe can be created with various types of inputs like lists, dictionaries, series, other dataframes, CSV files, and so on. It is a 2-Dimensional data structure.

It stores data in the form of rows and columns, which can be easily manipulated, dataframes are mutable, which means their shape can be altered even after their creation.

To get a better understanding of dataframes, do spend some time on this link.

Since we have a basic idea about what dataframes are, why not create our very first dataframe.

Creating a Pandas DataFrame

Traditional Way

To begin with, you will be needing an environment, for this article, I will be using Google Colab as my environment to demonstrate the various points. Now, since we are in our google collab environment we can start coding our very first data frame.

cars = pd.DataFrame({"Company Name" : ["Mahindra", "Tata", "Toyota"],
                                       "Price": [10000, 20000, 30000]})

For the sake of explanation, I have used a dictionary of lists as input data to create a dataframe. It’s just a simple code, I’ve just used the above-mentioned syntax and passed data to it, creating the dataframe.

If there are no errors in your code and you’ve followed along till here, then your dataframe should look something similar to this.

Now, consider a situation where you have a data set in which there are over 200,000 entries that you need to use to create a dataframe.

Manually typing in so many values would be very tiresome and tedious, and repetitive but our job as developers is to make the code as efficient as possible, so we must find a better way of creating a dataframe. Hence, we create will start creating our dataframes dynamically.

Dynamically Creating a DataFrame

Dynamically creating a dataframe is an efficient way of creating a dataframe due to various reasons which are mentioned above in this article.

So, let’s get started to make our process more dynamic.

cols = ["Company Name", "Price"]
data = [("Mahindra", 10000), ("TATA", 20000), ("Toyota",30000)]

Nothing much happening here I’ve just created two lists one is the column names, and one is the data for the data frame.

cars = pd.DataFrame(data, index = range(len(data)), columns = cols)

Here, we have created a dataframe with the help of the lists that we had made earlier, this could be the simplest example of creating a dataframe dynamically.

If everything works well even you should have a data frame similar to this by now.

Simple Dynamic Pandas DataFrame — Simple Dynamic DataFrame

Here if I want to add some data, I could simply go and append it to the ‘data’ list, and it would get reflected in our dataframe as shown below.

data.append(("Renault", 15000))

Simple Dynamic Df2 — Simple Dynamic Changed DataFrame

The new dataframe looks like this.

So dynamically creating the dataframe has made manipulating the data very easy and efficient.

At this point, you might be feeling the previous method was quite simple and easy to adapt, then why use this tricky and complex method?

Considering the previous example, yes it could be too much to create a dataframe dynamically for such a small data set, but when it comes to working with huge data sets this method becomes very handy. We shall see this in practice in further sections.

Importing Datasets

Using CSV

Talking about large datasets, it’s very rare that we would be creating them here, we would very obviously be using data from some other platform, viz. csv(comma-separated values), MS Excel, Google Sheets, or some other database.

Let’s see this as an example. For this example, I would be importing a csv file in my notebook, and then I’ll be trying to convert it into a dataframe

First, let’s import the dataset

car_data = pd.read_csv('path-to-file/dataset')

If you’ve executed the above code without any error, the data set should be successfully installed in your notebook, and the beauty of pandas is that it automatically converts the imported set into a dataframe.

So it’s quite effortless to create a dataframe when importing data from elsewhere, another import point of importing would be that now you are able to use the various methods and functions that pandas offer to manipulate a data frame to manipulate the data set that we have imported

Now we have seen that it is quite simple to convert a csv file into a dataframe, further on we will be taking an overview of how the same happens in the case of SQL databases

Using SQL Database

For explanation purposes, I will be using sqlite3, because it goes well with Python, but the code will be exactly the same in the case of SQL as well, so if you are looking for SQL code, feel absolutely sure to follow along with the code.

To take a deep dive into using SQL with Python take a look at this link.

I have already created a database for this explanation using the below code.

import sqlite3  #replace sqlite3 with sql to use SQL 

conn = sqlite3.connect('test_database') 
c = conn.cursor()

c.execute('''
          CREATE TABLE IF NOT EXISTS cars
          ([car_id] INTEGER PRIMARY KEY, [car_name] TEXT, [price] INTEGER)
          ''')
          
c.execute('''
          INSERT INTO cars (car_id, car_name, price)

                VALUES
                (1,'Nano',80000),
                (2,'Scorpio',2000000),
                (3,'Thar',3000000),
                (4,'Swift',4500000),
                (5,'Seltos',1500000)
          ''')                     

conn.commit()

So far, we have created our own database now, it is time to import it into our environment and convert it into a dataframe.

import sqlite3

conn = sqlite3.connect('test_database')

new_cars = pd.read_sql_query('''
                          SELECT * FROM cars
                             ''', conn)
new_cars

The above code is a simple example of importing a SQL database into our pandas workstation, first, we are importing SQL, then connecting to the desired database, and then converting it to a dataframe

If you have followed along without making any errors, your screen should look like this by now.

Creating a DataFrame Dynamically enhances Data Analysis

Dynamic creation of dataframes gives flexibility by enabling us to manage datasets of any size and scale, it also automizes the process of creating a dataframe rather than following the tedious process of doing so manually, since all this is being done automatically, it speeds up the process by many folds.

It gives an iterative approach to our data analysis, and as developers, we need to make our code reproducible, and dynamically creating the dataframe satisfies this need as well.

Summary

To conclude we would just be skimming through whatever we’ve done in the above article.

To begin with we started with getting a basic understanding of Pandas Library which is a widely used Python library. We understood that a dataframe is one of its primarily used data structures and it is actually a go-to tool used for data analysis because of the various reasons discussed above.

Dataframe is very important with respect to data analysis, and it is widely used. There are various ways by which a dataframe can be created. Broadly classified, they can be termed Manual creation and Dynamic creation. Dynamic Creation is what we covered deeply in this article.

Dynamically creating a dataframe has various advantages over manually creating one. To finish with it is also imported to understand the capability of Pandas to integrate data from various sources such as CSV, SQL database, and so on.

Reference

Official Documentation of Pandas.