Pivot Table in Python From One Dataset to Another

Pivot Table One Dataset

Sometimes we want to generate quick insights on only a portion of data from large datasets. Pivot table in Python comes in handy in such situations and helps us to extract the data we need and generate insights from the new data. Today we will be using the pivot table function present in Pandas. This article will give you a comprehensive overview of how you can pivot table one dataset to another in python.

Also read: How to insert multiple records in Sqlite3 database

What is a pivot table and how do you make a pivot table?

A pivot table is a table that helps in extracting data from a larger table or a dataset. In other words, we “pivot” data from a larger dataset. Let’s have a look at the syntax of a pivot table:

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None)

The pivot table function will return a dataframe. Now that we know the basics of a pivot table, lets’ start pivoting!

Load the Dataset

Let’s start by loading the dataset in our environment.

import pandas as pd
df= pd.read_csv("C://Users//Intel//Documents//covid-variants.csv")
df.head()
Screenshot 374

Now we will start building a pivot table.

Generating a Pivot Table in Python

num_sequences_by_variant = pd.pivot_table(df, index = 'variant', values = 'num_sequences',aggfunc='mean') 
print(num_sequences_by_variant)

                      num_sequences
variant                      
Alpha              270.696702
B.1.1.277            0.282744
B.1.1.302            0.116157
B.1.1.519            5.455306
B.1.160              8.130736
B.1.177             40.740201
B.1.221              3.675191
B.1.258              7.358270
B.1.367              0.229685
B.1.620              0.242830
Beta                 9.683078
Delta              916.371893
Epsilon             15.804732
Eta                  1.654876
Gamma               27.522945
Iota                10.254541
Kappa                1.787046
Lambda               2.249283
Mu                   3.405354
Omicron             27.614245
S:677H.Robin1        1.564771
S:677P.Pelican       1.156071
non_who            222.537763
others             153.585803

As you can see, we created a pivot table with the variables num_sequences and variants. As we have passed the aggfunc function as mean, the num_sequences is actually the average of all the num_sequences.

If you want the sum of all num_sequences, you can pass the sum to the aggfunc parameter.

total_by_variant = pd.pivot_table(df, index = 'variant', values = 'num_sequences',aggfunc='sum') 
print(total_by_variant)

Output

                 num_sequences
variant                      
Alpha                 1132595
B.1.1.277                1183
B.1.1.302                 486
B.1.1.519               22825
B.1.160                 34019
B.1.177                170457
B.1.221                 15377
B.1.258                 30787
B.1.367                   961
B.1.620                  1016
Beta                    40514
Delta                 3834100
Epsilon                 66127
Eta                      6924
Gamma                  115156
Iota                    42905
Kappa                    7477
Lambda                   9411
Mu                      14248
Omicron                115538
S:677H.Robin1            6547
S:677P.Pelican           4837
non_who                931098
others                 642603

Setting the Index for the Pivot Table in Python

We can very simply reset the index of the above dataframe with the help of the following command.

num_sequences_by_variant.reset_index()
Screenshot 376

Selecting/Filtering Data from a Pivot Table in Python

Now suppose, we want to look at the variants with the total number of sequences greater than 30000 we have to pass the following command.

total_by_variant[total_by_variant['num_sequences'] > 30000]
Screenshot 378

Now again, let’s suppose we want to see the variants having a number of sequences only between 30000 and 70000. Then we have to pass the following command.

total_by_variant[(total_by_variant['num_sequences'] > 30000) | (total_by_variant['num_sequences'] < 70000)]
Screenshot 380

Now let’s jump into learning about multi-index pivot tables.

Multi-Index Pivot Table in Python

Till now, we have only learned about pivot tabes with a single index. But we can also form pivot tables with multiple indices and that can help you generate more insights from your data.

In the following table, we have set variant type and date as our indices.

multi_index_table= pd.pivot_table(df, index = ['variant', 'date'], values = 'num_sequences', aggfunc = 'sum') 
print(multi_index_table)

Output

                               num_sequences
variant date                     
Alpha   2020-05-11              0
        2020-05-25              0
        2020-06-08              0
        2020-06-22              0
        2020-07-06              0
...                           ...
others  2021-11-15            928
        2021-11-29            928
        2021-12-13            619
        2021-12-27            603
        2022-01-05              2

[1080 rows x 1 columns]

Multi-Index Pivot Table with Multiple Functions

Apart from setting multiple indices, we can also set multiple functions to the aggfunc parameter for our pivot table.

For example:

multi_functions_table = pd.pivot_table(df, index = ['variant', 'date'], values = 'num_sequences', aggfunc = ['sum',len])
print(multi_functions_table)

Output

              sum                                               len
                   num_sequences num_sequences
variant date                                  
Alpha   2020-05-11             0            78
        2020-05-25             0            72
        2020-06-08             0            78
        2020-06-22             0            79
        2020-07-06             0            74
...                          ...           ...
others  2021-11-15           928            89
        2021-11-29           928            85
        2021-12-13           619            86
        2021-12-27           603            64
        2022-01-05             2            12

[1080 rows x 2 columns]

Adding New Columns To The Pivot Table in Python

In this step, we will add the variant feature as a column and set the date as the index. This will make your data more organized and readable.

add_columns=pd.pivot_table(df, index = 'date', columns = 'variant', values = 'num_sequences', aggfunc = 'sum') 
print(add_columns)

Output

variant      Alpha  B.1.1.277  B.1.1.302  B.1.1.519  B.1.160  B.1.177  \
date                                                                    
2020-05-11       0          0          0          0        0        2   
2020-05-25       0          0          0          0        0        0   
2020-06-08       0          0          0          0        0        2   
2020-06-22       0          0          0          0        0        8   
2020-07-06       0          0          0          0        0       72   
2020-07-20       0          7          3          0        8       88   
2020-08-03       0         12          8          0       22      172   
2020-08-17       0         43         12          0       89      499   
2020-08-31       0         94         24          0      386     1400   
2020-09-14       0        143         32          0     1017     2923   
2020-09-28       3        178         27          0     1049     6225   
2020-10-12      15        120         31          0     1188     8693   
2020-10-26      47        155         21          1     2017    13008   
2020-11-09     381        180         21          5     2773    15723   
2020-11-23    1450         72         27         17     2274    15622   
2020-12-07    1128         42         29         34     2029    10489   
2020-12-21    9061         28         42         82     3110    19350   
2021-01-04   14002         29         21        184     2934    14161   

Visuzaling Your Pivot Table in Python

In this section, we will create bar plots for our table. This will help you visualize the data more clearly.

add_columns.plot(kind='bar')
Image 7

Although the plot looks very clumsy, as we always say, you can use the matplotlib library or Seaborn library in Python to make the plots more presentable.

Working with missing values

In this section, we will learn how to handle missing values in your pivot table data. First, let’s check if there are any missing values present in the dataset.

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100416 entries, 0 to 100415
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   location             100416 non-null  object 
 1   date                 100416 non-null  object 
 2   variant              100416 non-null  object 
 3   num_sequences        100416 non-null  int64  
 4   perc_sequences       100416 non-null  float64
 5   num_sequences_total  100416 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 4.6+ MB
None

As we can see, we do not have any missing values in our dataset, so we do not need to do anything with our data.

In case you have any missing values in your data, you can pass the fill_value argument as N/A in the pivot table function.

Conclusion

In this article, we learned how to generate pivot tables from a given dataset and transform it into another dataframe. Pivot tables will really come in handy for you when you are analyzing any dataset with a large number of features or variables. Learning all the above pivot table functions by heart will help you a lot in your data analysis projects.