How to read .data files in Python?

Working With Data Files In Python

While working with data entry and data collection for training models, we come across .data files.

This is a file extension used by a few software in order to store data, one such example would be Analysis Studio, specializing in statistical analysis and data mining.

Working with the .data file extension is pretty simple and is more or less identifying the way the data is sorted, and then using Python commands to access the file accordingly.

What is a .data file?

.data files were developed as a means to store data.

A lot of the times, data in this format is either placed in a comma separated value format or a tab separated value format.

Along with that variation, the file may also be in text file format or in binary. In which case, we will be needing to access it in a different method.

We will be working with .csv files for this article, but let us first identify whether the content of the file is in text, or in binary.

Identifying data inside .data files

.data files come in two different variations, and the file itself is either in the form of text or in binary.

In order to find out which one it belongs to, we’ll need to load it up and test it out for ourselves.

Let’s get started!

1. Testing: Text file

.data files may mostly exist as text files, and accessing files in Python is pretty simple.

Being pre-built as a feature included in Python, we have no need to import any module in order to work with file handling.

That being said, the way to open, read, and write to a file in Python is as such:

# reading from the file
file = open("biscuits.data", "r")
file.read()
file.close()

# writing to the file
file = open("biscuits.data", "w")
file.write("Chocolate Chip")
file.close()

2. Testing: Binary File

The .data files could also be in the form of binary files. This means that the way we must access the file also needs to change.

We will be working with a binary mode of reading and writing to the file, in this case, the mode is rb, or read binary.

# reading from the file
file = open("biscuits.data", "rb")
file.read()
file.close()

# writing to the file
file = open("biscuits.data", "wb")
file.write("Oreos")
file.close()

File operations are relatively easy to understand in Python and are worth looking into if you wish to see the different file access modes and methods to access them.

Either one of these approaches should work, and should provide you with a method to retrieve the information regarding the contents stored inside the .data file.

Now that we know which format the file is present in, we can work with pandas to create a DataFrame for the csv file.

3. Using Pandas to read .data files

A simple method to extract info from these files after checking the type of content provided would be to simply use the read_csv() function provided by Pandas.

import pandas as pd
# reading csv files
data =  pd.read_csv('file.data', sep=",")
print(data)

# reading tsv files
data = pd.read_csv('otherfile.data', sep="\t")
print(data)

This method also converts the data into a dataframe automatically.

Below used is a sample csv file, which was reformatted into a .data file and accessed using the same code as given above.

   Series reference                                        Description   Period  Previously published  Revised
0    PPIQ.SQU900000                 PPI output index - All industries   2020.06                  1183     1184
1    PPIQ.SQU900001         PPI output index - All industries excl OOD  2020.06                  1180     1181
2    PPIQ.SQUC76745  PPI published output commodity - Transport sup...  2020.06                  1400     1603
3    PPIQ.SQUCC3100  PPI output index level 3 - Wood product manufa...  2020.06                  1169     1170
4    PPIQ.SQUCC3110  PPI output index level 4 - Wood product manufa...  2020.06                  1169     1170
..              ...                                                ...      ...                   ...      ...
73   PPIQ.SQNMN2100  PPI input index level 3 - Administrative and s...  2020.06                  1194     1195
74   PPIQ.SQNRS211X     PPI input index level 4 - Repair & maintenance  2020.06                  1126     1127
75       FPIQ.SEC14  Farm expenses price index - Dairy farms - Freight  2020.06                  1102     1120
76       FPIQ.SEC99  Farm expenses price index - Dairy farms - All ...  2020.06                  1067     1068
77       FPIQ.SEH14    Farm expenses price index - All farms - Freight  2020.06                  1102     1110

[78 rows x 5 columns]

As you can see, it has indeed given us a DataFrame as an output.

What are the other types of formats to store data?

Sometimes, the default method to store data just doesn’t cut it. So, what are the alternatives to working with file storage?

1. JSON Files

As a method to store information, JSON is a wonderful data structure to work with, and the immense support for the JSON module in Python has the integration feel seemingly flawless.

However, in order to work with it in Python, you’ll need to import the json module in the script.

import json

Now, after constructing a JSON compatible structure, the method to store it is a simple file operation with a json dumps.

# dumping the structure in the form of a JSON object in the file.
with open("file.json", "w") as f:
    json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}], f)
# you can also sort the keys, and pretty print the input using this module
with open("file.json", "w") as f:
    json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}], f, indent=4,  sort_keys=True)

Note that we are dumping into the file using the variable f.

The equivalent function to retrieve information from a JSON file is called load.

with open('file.json') as f:
    data = json.load(f)

This provides us with the structure and information of the JSON object inside the file.

2. Pickle

Normally, when you store information, the information is stored in a raw string format, causing the object to lose it’s properties, and we’ll need to reconstruct the object from a string through Python.

The pickle module is used to combat this issue, and was made for serializing and de-serializing Python object structures, such that it can be stored in a file.

This means that you can store a list through pickle and when it’s loaded up by the pickle module next time, you wouldn’t lose any of properties of the list object.

In order to use it, we’ll need to import the pickle module, there’s no need to install it as it’s a part of the standard python library.

import pickle

Let us create a dictionary to work with all of our file operations till now.

apple = {"name": "Apple", "price": 40}
banana = {"name": "Banana", "price": 60}
orange = {"name": "Orange", "price": 30}

fruitShop = {}
fruitShop["apple"] = apple
fruitShop["banana"] = banana
fruitShop["orange"] = orange

Working with the pickle module is just about as simple as working with JSON.

file = open('fruitPickles', 'ab') 
# the 'ab' mode allows for us to append to the file  
# in a binary format

# the dump method appends to the file
# in a secure serialized format.
pickle.dump(fruitShop, file)                      
file.close()

file = open('fruitPickles', 'rb')
# now, we can read from the file through the loads function.
fruitShop = pickle.load(file)
file.close()

Conclusion

You now know what .data files are, and how to work with them. Along with this, you also know the other options available to test out, in order to store and retrieve data.

Look into our other articles for an in-depth tutorial on each of these modules – File Handling, Pickle, and JSON.

References