How to Read SAS Files Using Pandas?

READING SAS FILES USING PANDAS IN PYTHON

Technologies like data science and machine learning rely on robust data infrastructure to function properly. That’s why tools like SAS Institute’s Statistical Analysis Software and Python have become so popular. SAS specializes in data management, predictions, business intelligence—areas that help tame data. It’s used extensively in clinical analysis, research, investigating crime, and other data-rich domains.

Similarly, Python with its vast data science libraries has become a flexible platform for analytics and visualization. Alongside alternatives like StataCorp’s STATA software for specialized medical and economic analysis, tools like SAS and Python help make sense of the growing oceans of data across industries. As information continues ballooning in volume and importance, capable data wrangling and analysis frameworks will remain essential.

Also read: How to read stata files using the Pandas library

Understanding SAS File Formats

When we talk about the formats of statistical analysis software, there are two : sas7bdat and XPORT(xpt).

sas7bdat: This file format is the standard format introduced by SAS to store the data. It is a binary-encoded format used for predictive analytics, business intelligence, and reporting. This format contained the database ID, metadata, and the contents of the data. This format can be read using SAS Studio and other libraries/ packages that are compatible with the software.

One such third-party package of Python is the sas7bdat package of PyPI, which reads the files using Python without the need for the software.

XPORT: Just as the name suggests, this file format is used to transport the data between various platforms, operating systems, and versions of the software. These files have a .xpt extension.

Introduction to Statistical Analysis System (SAS)

Statistical Analysis System (SAS) is used by many researchers due to its capability to store data collected from multiple sources in the same location. SAS is written in C language and can be used for data management, analytics, and business intelligence.

It can be used as software, a programming language, and a data management tool. It is platform-independent and is compatible with high programming languages like Python, and R. It is supported by almost all operating systems.

With its statement-like code syntax, it is easy to learn and implement. One of the main applications of SAS is reporting. With SAS, we can save and download the report document in the form of a PDF, RTF, PowerPoint, and so on.

Recommended Read: Data Science vs Data Analytics

Moving forward, we will see how to read both these formats using the Pandas library as a data frame.

Reading SAS Formats as a Dataframe

The Pandas library has a very unique method for reading every storage format into a data frame, which makes it easier for such formats to be cross-language and cross-platform. Pandas Library has a special method for reading the SAS data too. Let us take a look at the syntax of the method and the important parameters.

pandas.read_sas(filepath_or_buffer, *, format=None, index=None, encoding=None, chunksize=None, iterator=False, compression='infer')
  • filepath_or_buffer: This parameter is used to fetch the path or the location of the SAS file. It accepts a string or a path-like object as input
  • format: This parameter also accepts a string as input which specifies if the file is in the sas7bdat or export format. This parameter is not mandatory, as the method automatically infers the format from the file path
  • index: This parameter is optional. It is used when we wish to include an index column in the data frame. The default is None
  • encoding: If the file has text data, this parameter is used to encode the dataset
  • compression: We can also choose to compress the data using this parameter. The default compression method is infer

The datasets we are going to use for both the use-cases are the demographics data, which contain information about the person, country name, ID, gender and other important data about a person of a country.

Reading sas7bdat File Format

Let us take a look at how to read a SAS file into a data frame using the above-discussed method.

import pandas as pd 
df =  pd.read_sas('/content/Demog.sas7bdat')
print(df)

In this code, we have imported the pandas library in the first line. Next, we have created a data frame named df to store the dataset read by using the method. The data frame is then printed in the last line.

Reading a sas7bdat file into a data frame
Reading a sas7bdat file into a data frame

Reading an xpt File Format

We can read an xpt file in the same manner as discussed above.

dfx = pd.read_sas('/content/dm.xpt')
print(dfx)
Reading an xpt file
Reading an xpt file

We can also specify the format of the file explicitly as shown below.

fp = '/content/dm.xpt'
dfx1 = pd.read_sas(fp, format = 'xport')
print(dfx1)

This code results in the same output as the above code.

Summary

To reiterate what we have discussed, we have learned about statistical analytics software and its features which make it the best platform for data-related use cases. Then we discussed the types of formats available to save and export the data. Lastly, we have seen how to read these formats and obtain a dataframe using the Pandas library. How will integrating SAS data with Python reshape the future of data analysis?

Datasets Used

References

Pandas Documentation