How to Read a Dta File Into a Data Frame Using Pandas?

Stata is software that can handle two crucial components- statistics and data. Stata allows you to store your data securely and perform all kinds of analysis on the data. Stata is the favorite tool of all the data scientists out there as it accommodates data manipulation, visualization, and manipulation without the need to go for different software or tools for each task.

Stata is mainly used by researchers and academicians in the fields of political science, economics, and biomedicine to manage and visualize their data. Stata helps them to observe patterns in their data and draw conclusions. But anyone who is intrigued by data can use Stata to experiment with it.

Similar to how text files are stored with a txt extension, excel sheets with xlsx, and Word documents with docs, stata also stores the data with a dta extension.

All the data you work with or generate using this software are saved with a dta extension.

Imagine you are working with the dta files and want to use them in different languages without having to recreate the same data in a format compatible with a particular language. It is possible in Python language as it has a special method dedicated to reading a dta file into the basic structure of the Pandas library – a data frame.

Transportability of different file structures like Excel, Stata, and SPSS is achieved by using the most important library of the language – Pandas.

In this article, we are going to look at the syntax and examples for importing the dta file as a data frame.

Before that, please check out this tutorial on the Pandas library

Understanding a Data Frame

A data frame, as simple as it sounds, is a storage container for the data. It stores the data in the form of rows and columns spread across multiple entries. A data frame is the primal data storage structure used for data analysis, data science, and machine learning.

There are many approaches to creating a data frame. We can obtain a data frame from a dictionary and even a list.

Data Frame From a List

import pandas as pd
fruits = [['apple','red',1],['banana','yellow',2],['tangerine','orange',3]]
df=pd.DataFrame(fruits, columns=('Fruits',"Colors","Quantiities"))
df

In the first line, we have imported the pandas library. The fruits variable contains a list of fruits that we later convert into a data frame. The variable df consists of the data frame obtained from the method pd.DataFrame.We have also specified the index of each element in the data frame along with the column name.

How about we create data frames dynamically?

Now that we have understood what is a data frame. let us get to the main topic of discussion.

Exploring the Syntax of pd.read_stata

The read_stata is a function that allows us to read or export state files(dta) in the form of a data frame. Let us understand the syntax and arguments of the read_stata function.

pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)

Let us see the important arguments of this function.

The filepath_or_buffer consists of the file or path to the dta file that needs to be read into the data frame.

convert_dates is used to convert the state date format to that of a format compatible with pandas.

convert_categoricals decides if the categorical variables of the dta format should be converted to the categorical variables of a data frame.

index_col : The data in this argument is used to give an index to the data frame.

preserve_dtypes is used to determine if the different data types of the file should be preserved in the data frame.

convert_missing: There might be a few values in the data file that might be missed while converting, or there might be missing values from the start. This argument determines if we should replace these missing values.

chunksize: Sometimes, we might encounter huge stata files of a size quite larger than what Pandas library can process. In such cases, the file is broken down into chunks.

Let us see a few examples of the syntax.

All the examples used in the following sections are taken from a survey conducted during COVID. The survey proves that the hygienic measures taken during COVID, would ultimately decrease the most crucial antibody- the Immunoglobin G(IgG) levels across all age groups.
Yamaguchi H, Hirata M, Hatakeyama K, Yamane I, Endo H, Okubo H, et al. (2022) Hygienic behaviors during the COVID-19 pandemic may decrease immunoglobulin G levels: Implications for Kawasaki disease. PLoS ONE 17(9): e0275295.

Reading a Dta File Into a Data Frame

Let us take a simple example to know how the function works. The dataset we use in this example calculates the oldest person with the highest IgG levels. In this example, we are going to export the data into a data frame based on the birth year of the participants.

Refer to this article for the reverse process.

import pandas as pd
dtafile = '/content/Oldest_IgG.dta'
df = pd.read_stata(dtafile,index_col='birth_year')
df.info()

In the first line, we imported the pandas library. The dataset is saved in the dtafile. Now, we export this state file to a data frame with the help of the function read_stata. The index_col is used to assign an index label to the data frame. The data frame is stored in the variable called df. A brief description of the columns of the data frame is displayed in the last line.

df.head is used to print the first five entries of the data frame.

As you can see from the output, the data frame has an index column which is the birth_year column of the original file.

Data Visualization using the Google Colab Suggest Chart Feature

The Google Colaboratory is one of the popular notebooks that support multi-languages and cool features. It just introduced a cool feature that automatically, based on your dataframe, suggests different kinds of charts or graphs you can plot to analyze the data in your frame. Let us take a simple dta file, convert it into a data frame and visualize it.

dta2='/content/Neonate_with_IgG.dta'
df = pd.read_stata(dta2)
df

This dataset provides information about the IgG levels of neonatal(newborn) babies. The dataset is saved in dta2 and is converted to a data frame called df. This data frame is displayed in the last line.

The above image shows the data frame and the Suggest Chart feature beside it. We are going to use the Suggest Chart feature to visualize the data we have. The even more surprising thing is you don’t need to write the code to visualize the data! Click on the chart or graph you want to generate, colab gives you the code for the graph.

Various visualizations provided by Colab

Let us implement a time series analysis for the data frame.

import numpy as np
from google.colab import autoviz
df_2742181028172811714 = autoviz.get_df('df_2742181028172811714')

def time_series_multiline(df, timelike_colname, value_colname, series_colname, figsize=(2.5, 1.3), mpl_palette_name='Dark2'):
  from matplotlib import pyplot as plt
  import seaborn as sns
  palette = list(sns.palettes.mpl_palette(mpl_palette_name))
  def _plot_series(series, series_name, series_index=0):
    if value_colname == 'count()':
      counted = (series[timelike_colname]
                 .value_counts()
                 .reset_index(name='counts')
                 .rename({'index': timelike_colname}, axis=1)
                 .sort_values(timelike_colname, ascending=True))
      xs = counted[timelike_colname]
      ys = counted['counts']
    else:
      xs = series[timelike_colname]
      ys = series[value_colname]
    plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

  fig, ax = plt.subplots(figsize=figsize, layout='constrained')
  df = df.sort_values(timelike_colname, ascending=True)
  if series_colname:
    for i, (series_name, series) in enumerate(df.groupby(series_colname)):
      _plot_series(series, series_name, i)
    fig.legend(title=series_colname, bbox_to_anchor=(1, 1), loc='upper left')
  else:
    _plot_series(df, '')
  sns.despine(fig=fig, ax=ax)
  plt.xlabel(timelike_colname)
  plt.ylabel(value_colname)
  return autoviz.MplChart.from_current_mpl_state()

chart = time_series_multiline(df_2742181028172811714, *['period_6', 'igm', None], **{})
chart

The output is given below.

Conclusion

We have seen what Stata software is and how it is useful for data scientists and mathematicians. We have learned what is a data frame and a basic example of it.

We have taken a real-world data set based on the survey conducted, according to which the hygienic measures taken during COVID by the people led to a drastic decrease in antibody levels.

In the first case, we used the year of birth of the participants of the data to construct a data frame.

In the second case, we used the Suggest Chart feature of Google Colaboratory to plot the data.