How to Normalize semi-structured JSON data into a flat table?

Normalize Semi Structured JSON Data Into A Flat Table

JSON is the abbreviation for JavaScript Object Notation, mainly used for transporting data between web pages. JSON is language independent, which means it can be used with any programming language.

JSON is a file format that is used widely because it is a lightweight data exchange format easy to understand and parse by both humans and machines.

The main advantage of JSON is that it can be used as a medium of data exchange between clients and servers in web applications. It is also used for configuration files, logs, and data storage.

A JSON resembles a dictionary in python in data representation.

Look at the example for JSON given below.

{
    "name": "John Smith",
    "age": 35,
    "isMarried": true,
    "hobbies": ["reading", "hiking", "cooking"]
}

So in the above example, the attributes ‘name’, ‘age’, ‘isMarried’, and’ hobbies’ are taken as keys. And the data after ‘:’ is treated as a value. We can say that JSON stores the data in the form of key: value pairs.


Nested JSON

As self-describing as it is, a nested JSON is a big JSON object which is a collection of other JSON objects.

Here is an example of a nested JSON object.

{
    "firstName": "John",
    "lastName": "Doe",
    "age": 30,
    "address": {
        "street": "123 Main St",
        "city": "Anytown",
        "state": "CA",
        "zip": "12345"
    },
    "phoneNumbers": [
        {
            "type": "home",
            "number": "555-555-1212"
        },
        {
            "type": "work",
            "number": "555-555-2121"
        }
    ]
}

In this example, the top-level object has five properties: “firstName”, “lastName”, “age”, “address”, and “phoneNumbers”. The “address” property is itself an object that has four nested properties: “street”, “city”, “state”, and “zip”. The “phoneNumbers” property is an array of two objects, each of which has two properties: “type” and “number.”

The nesting creates a hierarchical structure useful for organizing and representing complex data.


Why Normalize JSON?

While using JSON and nested JSON is useful in the hierarchical storage of the data, it might become difficult to work with complex data. The main disadvantage of JSON is that it has limited data types, and we might have to deal with a few data types that are not supported by this format. So, when we normalize the JSON into a flat table structure, it is even easier to deal with the complex data and also can be converted into other structures like a data frame after normalization.

Normalization of JSON might also help with the security of the data.


Introduction to Pandas.json_normalize

The JSON object can be normalized to reduce the redundancy and complexity of manipulation. Normalizing to a flat table allows the data to be queried and indexed.

The Pandas Library provides a method to normalize the JSON data.

If you are unfamiliar with the Pandas Library and its basic data structures, read this article on Introduction to Pandas.

The syntax is given below.

pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, record_prefix=None, errors='raise', sep='.', max_level=None)

The important parameters of the syntax are:

Number Argument DescriptionDefault value/typeRequirability
1dataUnserialized JSON object, which has to be converted to a table
This argument can be a nested JSON, a list of JSON objects, or a JSON string
dict or list of dictsRequired
2record_pathThis argument is used to specify the path to the records of the JSON, which needs to be flattened
When this argument is present, the records that are given with this argument are flattened
If not passed, all the records are flattened
str or list of str,
default- None
Required
3metaFields to use as metadata for each record in the resulting table
These records will not be flattened and are used to describe the other fields that are flattened
str or list of str,
default- None
Required
4meta_prefixThis field is used as a string to name the fields in meta
str,
default- None
Required
5record_prefixThis field is used as a string to prepend the columns from record_pathstr,
default- None
Required
6errors‘ignore’: Will ignore KeyError if keys listed in meta are not always present
‘raise’: Will raise KeyError if keys listed in meta are not always present
‘raise’, ‘ignore’,
default- None
Required
7sepThe separator to use when flattening nested datastr,
default-‘.’
Required
8max_levelThe maximum number of levels to normalize
It can be used to limit the amount of flattening that occurs
If None, normalize all levels.
int,
default- None
Required
Arguments of json_normalize

Return Type: Returns a DataFrame.


Examples of Using Pandas.json_normalize

In this post, we will look at creating a nested JSON object and then normalizing it. We will also see loading a JSON file as a file path and normalizing it into a flat table.

We are going to see the usage of record_path to normalize specific columns.


Example 1: Normalizing a Nested JSON

Let us create a nested JSON with the help of the key: value pair format and normalize it.

The code for creating JSON is given below.

import pandas as pd
import json
#creating json
data = [
    {"Roll no": 44,
     "first_name": "Kalyan",
     "last_name": "Sai",
     "Sanskrit": "85",
     "English": "100",
     "French": "67",
     "Biology": 35,
     "Physics": 70,
     "Math": 82
     },
    {"Roll no": 45,
     "first_name": "Sai",
     "last_name": None,
     "Sanskrit": "95",
     "English": "88",
     "French": None,
     "Biology": None,
     "Physics": None,
     "Math": None
     },
    {"Roll no": 46,
     "first_name": "John",
     "last_name": None,
     "Sanskrit": None,
     "English": None,
     "French": "90",
     "Biology": "82",
     "Physics": None,
     "Math": None
     },
    {"Roll no": 47,
     "first_name": "Alyssa",
     "last_name": "Choi",
     "Sanskrit": None,
     "English": None,
     "French": None,
     "Biology": None,
     "Physics": "73",
     "Math": "98"
     }
]

print(data)

Let us break down the code.

In the first line, we are importing the Pandas library as pd, which is essential for normalizing JSON.

Next, we are importing json package to be able to work with JSON objects.

So we are creating a nested dictionary stored in a variable called data.

We are printing this using print().

We specify that there are no records for some keys by using ‘None.’ This is helpful to make the data less complex, and not doing so might raise some errors.

Dictionary
Dictionary

The data is not yet in JSON format. We need to convert this dictionary into JSON.

The code is given below.

json_data = json.dumps(data)
dict_data = json.loads(json_data)
print(dict_data)

json.dumps is a method of the json package used to convert any python object into a json formatted string. This is stored in a variable called json_data.

If you have seen the syntax json.normalize, you might know that this method only accepts a dictionary or a list of dictionaries. Hence, we use json.loads to create a dictionary object of the json, which is stored in dict_data.

In the following line, we are printing this data

JSON
JSON

Although both the outputs look essentially the same, they are not. The first output is the python object, while the second is a json object.

Now, we need to normalize the json into a table.

#normalizing
df = pd.json_normalize(json_data)
print(df)

We are using the json_normalize method to flatten the json into a data frame.

The flattened table is obtained as shown below.

JSON-DF1
JSON-DF1

Although a table or data frame is obtained, this needs to be modified to increase readability.

We can replace the NaN values with zeroes and perform some cleaning.

Read this article on How to replace NaN values in a Pandas data frame with 0.


Example 2: Reading JSON from File Path to Normalize

Let us take a dataset already in JSON format, read it from its file path, and then normalize it.

The JSON file we will work on is a dataset of numerical reasoning over financial data.

The JSON file, when downloaded, will look something like this.

JSON File
JSON File

The code is given below.

import pandas as pd
import json
#loading a json file
df=pd.read_json('/content/test.json')
 with open('/content/test.json') as f:
  d=json.load(f)
df=pd.json_normalize(d)
print(df.head())

Firstly, we are importing the Pandas library with its standard alias name pd.

Next, we are importing json package.

In the following line, we are reading the json file using pd.read_json and is storing it in an object ‘df.’

Next, we open this file with a short access keyword ‘d’.

json.load(): Since json_normalize only accepts dictionaries, the load method is used to convert this json into a dictionary-like object.

One difference you might observe from the previous example to this one is the usage of json.dumps. Since the data in this example is already in JSON, we don’t need to use the dumps method. But in the previous example, the data we wished to flatten was a python object. So we used dumps to convert it into JSON object.

Next, we are normalizing this json file into a data frame.

In the following line, we are printing the first five entries of the data frame.

The normalized table is shown below.

JSON-DF2
JSON-DF2

You have seen one of the many methods to read a JSON file.

Check out this article on The methods to read a JSON file as a path.


Example 3: Using the record_path parameter of Pandas.json_normalize()

Let us take the same dataset from the above example and use record_path.

We follow the same code as the above example but use the record_path in the syntax.
This parameter is used to normalize a specific column from the file.

import pandas as pd
import json
#loading a json file
df=pd.read_json('/content/test.json')
with open('/content/test.json') as f:
  d=json.load(f)
#normalizing the 'pre_text' column
df1=pd.json_normalize(d,record_path='pre_text')
print(df1.head())

The first two lines import the necessary library and packages to work with json.

In the fourth line, we are reading the json file as a path using read_json.

Next, we convert this json string into a dictionary using load to be able to normalize it.

But, in the eighth line, we are normalizing only one column of the dataset, which is ‘pre_text’.

So in the output, we can only see that the column ‘pre_text’ is normalized.

record_path1
record_path1

Now suppose you want to normalize two columns at the same time. This can be done using pd.concat.

First, we need to normalize two columns separately and then concatenate them.

#normalizing column2
df2=pd.json_normalize(d,record_path='table_ori')
print(df2.head())

In this code snippet, we are normalizing another column from the dataset, which is ‘table_ori’.This is stored in a variable called df2.

We are printing the first five entries of the column with the df2.head() method.

record_path2
record_path2

Now, let us concatenate them.

dff=pd.concat([df1,df2],axis=1)
print(dff)

We are using the pd.concat method to combine both the normalized columns and store them in a new data frame called dff.

record_path3
record_path3

Conclusion

In this post, we have seen what a JSON file, its widespread applications in web APIs, and its resemblance to dictionaries in python is.

Next, we have seen the nested JSON, an example, and how nested json is used to store the data hierarchically.

Next, we have learned why we should normalize this format, one of the most important reasons being the lack of data structures in JSON.

We have seen the syntax of pd.json_normalize and its parameters.

Coming to the examples, we have seen how to create a JSON from python dictionaries and then normalize it with the help of json.loads and json.dumps.

Next, we took a JSON file as a path, read it, and normalized it.

In the next example, we have seen the usage of record_path to normalize the specified columns. We have also seen the usage of pd.concat to get two normalized columns in one data frame.


References

Find the JSON file here

Refer to the Official Pandas Documentation