Converting Data in CSV to XML in Python

Comma Separated Values (CSV) is a widely used format to store data. It successfully stores data in the form of rows and columns in an easily parseable format. As the name suggests, the rows in a CSV file are separated by a delimiter, usually the newline (‘\n’) character. Each value in a row is separated from the other using a comma (,). Opening a CSV file using Microsoft Excel portrays the data in a tabular format, but the CSV format can be understood by opening the CSV file using a text editor as well. There are libraries available in Python that can be used to parse CSV files, but the most common library is the Pandas library.

If you are unfamiliar with the pandas library, check out this tutorial on pandas.

Representation Of A CSV File In A Text Editor

In the above image, one can see that the first row is nothing but the headers or the titles of the columns. These titles are used in Python as a subscript to Pandas dataframe to call an entire column.

Extensible Markup Language or XML is a Markup Language used to represent data in a structured format. It uses tags in ‘<>’ characters, with opening and closing tags representing data. The tag consists of the key/type of the data included in the tag content. The content of the tag includes the values of the key/type in the tag. It would be more clarified as the article progresses. XML also provides styling options, to beautifully display the data. An XML file can be opened with supported browsers and can be edited with text editors or dedicated XML editors.

The main motto of this article is to write a program that reads data from the CSV file and converts it into structured XML format. The tags of the XML will be named after the column names in the CSV file. The tag content will consist of the row values in the CSV. The above CSV will be used to be converted to XML format. The given input (in tabular form) and the expected output (in XML form) are shown below:

<collection oldcars="Car Details">
<carid title="C001">
<company>Hyundai</company>
<model>Aura</model>
<releaseyear>2019</releaseyear>
<licenseno>GJ23BB8384</licenseno>
<color>White</color>
<kmstravelled>28000</kmstravelled>
</carid>
<carid title="C002">
<company>Maruti</company>
<model>Eeco</model>
<releaseyear>2001</releaseyear>
<licenseno>GJ15AM7634</licenseno>
<color>White</color>
<kmstravelled>102013</kmstravelled>
</carid>
<carid title="C003">
<company>Hyundai</company>
<model>Alcazar</model>
<releaseyear>2021</releaseyear>
<licenseno>GJ01KJ9845</licenseno>
<color>Red</color>
<kmstravelled>3401</kmstravelled>
</carid>
<carid title="C004">
<company>Honda</company>
<model>City</model>
<releaseyear>2007</releaseyear>
<licenseno>GJ06RE4198</licenseno>
<color>Black</color>
<kmstravelled>45000</kmstravelled>
</carid>
<carid title="C005">
<company>Mahindra</company>
<model>Bolero</model>
<releaseyear>2006</releaseyear>
<licenseno>GJ17OY8714</licenseno>
<color>Black</color>
<kmstravelled>129090</kmstravelled>
</carid>
</collection>

Expected XML Output after the Program has run

A Python function will be created which takes the input of the name and path of the CSV file and the name and path of the output XML file. The rest of the work will be done by the code. The driver program will call the function, and then the XML file will be opened in Browser to check the output. It should be noted that no style characteristics are being added to the XML structure here, and simply an XML Document Tree is being created.

Creating the Function

The function takes two inputs, the input file path and the output file path. But there needs to be an Error checker to make sure the inputs given to the function are accurate and not wrong. Therefore, the first thing the function does is to check whether the given input file name is a CSV file and the given output file name is an XML file. The following code demonstrates the same:

import pandas as pd
def CSVtoXML(inputfile,outputfile):
    if not inputfile.lower().endswith('.csv'):
        print('Expected A CSV File')
        return 0
    if not outputfile.lower().endswith('.xml'):
        print('Expected a XML file')
        return 0

The first thing to be done in the above code is to import the Pandas module. It would be highly necessary to read the CSV file as a dataframe. The check on the input file as CSV and the output file as XML is done using Python String’s endswith() function. The file always ends with its extension, in the case of CSV is ‘.csv’, and in the case of XML is ‘.xml’. The endswith() function checks the last characters of the string with the given input. If both of them match and are equal, it returns True or returns False.

Now, a try-except block needs to be established to ensure that the given CSV file as an input exists. If the File is not found, the program shall return a FileNotFoundError. On intercepting this error, the program shall return an Error message giving the input as an existing file. The following code does the same:

    try:
        df=pd.read_csv(inputfile)
    except FileNotFoundError:
        print('CSV file not found')
        return 0

The above code snippet tries to read the given CSV file as a Pandas Dataframe object. If the file doesn’t exist, pandas will produce a FileNotFoundError.

Creating XML from Dataframe

Now that the CSV file has been successfully read as a Dataframe, the code manipulates the dataframe into a string format, which matches the actual XML required format. The following code demonstrates the same:

    entireop='<collection oldcars="Car Details">\n'
    att=df.columns
    rowop=''
    for j in range(len(df)):
        for i in range(len(att)):
            if i==0:
                rowop=rowop+f'<{att[i]} title="{df[att[i]][j]}">\n'
            elif i==len(att)-1:
                rowop=rowop+f'<{att[i]}>{df[att[i]][j]}</{att[i]}>\n</{att[0]}>\n'
            else:
                rowop=rowop+f'<{att[i]}>{df[att[i]][j]}</{att[i]}>\n'
    entireop=entireop+rowop+'</collection>'
    with open(outputfile,'w') as f:
        f.write(entireop)
CSVtoXML('Car Details.csv','example.xml')

In the above code snippet, entireop variable contains the string of the entire XML output, which is required. The XML must always be declared with a tag describing the data portrayed by the XML Document Tree. Here, the tag is named <collection>. Also, a description of the data is given as oldcars, implying that the car details portrayed by the XML Document are old cars.

The columns feature of Dataframe gives a list containing all the column names of the Dataframe as an output. It is used to store each and every column name, and they can be later used as a subscript to the Dataframe to call an entire column. The att variable stores the column list. The remaining task is to run through all the rows in the Dataframe, get a separate input for each value in the columns of the row, create a tag using the column name, and put the tag content as the value taken from the row.

The for loop does the task described earlier. To accurately locate a particular value in the Dataframe, the exact column number and row number are required. For e.g., the car model ‘Alcazar’ is situated in the third column of the third row. So it is called a Dataframe[2][2]. The first subscript is column number, and the second subscript is row number. It should be noted that the Dataframe follows 0-indexing; that is, the first element in Dataframe is Dataframe[0][0]. An easier way to call the value at Dataframe[2][2] is by subscripting the column with its column name. It is same as Dataframe[‘model’][2]. As the ‘model’ value lies at the third position in the att list, it can be easily referenced by the Dataframe, using the att list.

That is done in the for loop provided above. The XML Document Tree also should be able to independently identify each and every data object uniquely. That is the reason that the first tag of every data object, in this case, carid, is provided with a title attribute containing the data value of carid. The rest of the tags are simply column names enclosed in <> as starting tags and </> as ending tags. The tag content is put as the value in the column at the current row number.

After the entire XML tree has been created, it needs to be connected with the <collection> tag created earlier. Line 25 does the same as it joins entireop with the produced XML Document Tree and closes the collection tag. After this, the output file is opened in writing mode, and the XML Document Tree created in a String form is written to the file. The following output is produced after the program is run:

XML Output Produced After The Program Has Run

<collection oldcars="Car Details">
<carid title="C001">
<company>Hyundai</company>
<model>Aura</model>
<releaseyear>2019</releaseyear>
<licenseno>GJ23BB8384</licenseno>
<color>White</color>
<kmstravelled>28000</kmstravelled>
</carid>
<carid title="C002">
<company>Maruti</company>
<model>Eeco</model>
<releaseyear>2001</releaseyear>
<licenseno>GJ15AM7634</licenseno>
<color>White</color>
<kmstravelled>102013</kmstravelled>
</carid>
<carid title="C003">
<company>Hyundai</company>
<model>Alcazar</model>
<releaseyear>2021</releaseyear>
<licenseno>GJ01KJ9845</licenseno>
<color>Red</color>
<kmstravelled>3401</kmstravelled>
</carid>
<carid title="C004">
<company>Honda</company>
<model>City</model>
<releaseyear>2007</releaseyear>
<licenseno>GJ06RE4198</licenseno>
<color>Black</color>
<kmstravelled>45000</kmstravelled>
</carid>
<carid title="C005">
<company>Mahindra</company>
<model>Bolero</model>
<releaseyear>2006</releaseyear>
<licenseno>GJ17OY8714</licenseno>
<color>Black</color>
<kmstravelled>129090</kmstravelled>
</carid>
</collection>

The Final XML Output is Saved in example.xml file

This XML file can also be opened in Browser. It will produce an output like this:

Pandas Dataframe To XML Function()

Pandas also provides a function to the Dataframe which allows the user to directly convert the CSV file to an XML file. It provides with more XML-accurate structure and also the produced output can be easily parsed by the browser. The entire work can be done in two lines of code:

df=pd.read_csv('Car Details.csv')
df.to_xml('example.xml')

It produces more or less the same kind of output. The output can be seen in the image below:

Also Read: Pandas DataFrame.to_xml – Render a DataFrame to an XML Document

Entire Code for Reference

The defined function CSVtoXML() has been presented below for reference. It works on a time complexity of O(c*r) where c is the number of columns in the CSV file and r is the number of rows in the CSV file.

import pandas as pd
def CSVtoXML(inputfile,outputfile):
    if not inputfile.lower().endswith('.csv'):
        print('Expected A CSV File')
        return 0
    if not outputfile.lower().endswith('.xml'):
        print('Expected a XML file')
        return 0
    try:
        df=pd.read_csv(inputfile)
    except FileNotFoundError:
        print('CSV file not found')
        return 0
    entireop='<collection oldcars="Car Details">\n'
    att=df.columns
    rowop=''
    for j in range(len(df)):
        for i in range(len(att)):
            if i==0:
                rowop=rowop+f'<{att[i]} title="{df[att[i]][j]}">\n'
            elif i==len(att)-1:
                rowop=rowop+f'<{att[i]}>{df[att[i]][j]}</{att[i]}>\n</{att[0]}>\n'
            else:
                rowop=rowop+f'<{att[i]}>{df[att[i]][j]}</{att[i]}>\n'
    entireop=entireop+rowop+'</collection>'
    with open(outputfile,'w') as f:
        f.write(entireop)
CSVtoXML('Car Details.csv','example.xml')

Conclusion

XML is a Markup Language that produces data in a structured format for the user. It can also design and produce beautiful output on the data if style attributes are provided to the XML. CSV is a data storage format that uses delimiters to separate rows and column values from each other. It is comparatively easy to convert a CSV file to an XML file. Usually, the column names are considered tag names, and the values in the row are considered tag content. It uses simple string manipulation techniques to create the XML Document Root Tree and then save it to an XML file. Creating the function using nested for loops might lead to the function taking a lot of time. Using the to_xml() function will do the same work faster and in a more standard way.

References

1) Stackoverflow