How to Export Dataframe to Stata Dta Format?

How To Export A Data Frame To Dta Format

Stata is a powerful statistical software package used for data science-related tasks. Stata is a complete, integrated software package that is a one-stop solution for all your data science needs—data manipulation, visualization, statistics, and automated reporting. It is a combination of the two words statistics and data.

Stata is mainly used by researchers and academicians in the fields of political science, economics, and biomedicine to manage and visualize their data. Stata helps them to observe patterns in their data and draw conclusions. But anyone who is intrigued by data can use Stata to experiment with it.

Stata not only helps with the data-related tasks but also with your reporting tasks. You can also manage all your reports with Stata’s automated reporting facility.

How can Stata be useful to developers or data scientists? Well in this world right now, everything is data. From your average data to big data(which is growing exponentially even now as we talk) or dark data which is not even half explored, we need a tool to store, manipulate and visualize the data.

You may be working on a huge project with tons of data files that are unstructured and might be a little messy to handle. Stata can help us manage data from small to huge sizes. Not only manage the data, but you can also manipulate your data using this software.

As we know by now, data can be of many forms. It can be an Excel file, a Word document, in the form of tables, and even a data frame.

So it brings us to the main question; can we store a data frame in a Stata format?

Pandas is one popular library used to work with data. It has a number of methods for the interconversion of data formats. We can store a data frame in many formats like Excel, Parquet, Feather, and vice versa. So, we can also store a data frame in a Stata format with the help of the Pandas library.

Related: Read this article to learn about data analysis in Python

Let us first see the introduction of a data frame and then jump right into it.

What Is a Data Frame?

A data frame is the most popularly used storage unit for data. Just like a table, a data frame stores the data in the form of rows and columns. It can store heterogeneous data which means, a data frame contains data of multiple types. While the header row contains a string data type, the elements inside can be numerical.

The pd.DataFrame method is used to return a data frame from data structures like lists, dictionaries, and a list of dictionaries. A data frame can also be created in Excel format, CSV format, and so on.

Refer to this article on how to read an SQL table as a data frame

Let us work on a few examples of creating a data frame.

Creating a Data Frame From a List

We can create a data frame from a list. But first, we need to define the list of elements.

import pandas as pd
ls=['Mangoes','Apples','Oranges','Tomatoes','Potatoes']
df=pd.DataFrame(ls)
print(df)

The `pd.DataFrame is a method of the Pandas library. So in order to use it, we need to first import the library.

Next, we created a variable ls to store the elements in a list format.

The variable df is used to store the data frame. Lastly, we are printing the data frame.

Data Frame From A List
Data Frame From A List

Creating a Data Frame From a Dictionary

Just like the above example, we can create a data frame from a dictionary.

import pandas as pd
dct={'Groceries':['Mangoes','Apples','Oranges','Tomatoes','Potatoes'],
'cost':[80,100,70,60,30],
'No of units':[2,3,4,1,1]}
df=pd.DataFrame(dct)
print(df)

Here, we have used a dictionary of three lists- Groceries, their costs, and the number of units brought. All these key-value pairs are stored in a variable called dct. This dictionary is then converted into a data frame called df.

Data Frame From A Dictionary
Data Frame From A Dictionary

Data Frame to Stata Method Explained

The DataFrame.to_stata method writes a data frame to a Stata file. The Stata file should be saved with .dta extension.

The syntax of this method is given below.

DataFrame.to_stata(path, *, convert_dates=None, write_index=True, byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None, compression='infer', storage_options=None, value_labels=None)
Name of the Argument Description
pathThis argument is used to include the time stamp of when the document was created
The default is the current time
convert_datesIt is a dictionary that is used to convert the date format to stata-supported formats. Can be anything from: { ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’}
tc- tc is the format in the form of a calendar
td-It is a date format in the form of dates since 1 January 1960
tm- Date format in the form of months
tw- Date format in the form of weeks
th- Hourly format of the date
tq- Quarterly format of the date
ty- Yearly format of the date
If the datetime column has a time zone attribute, this argument raises NotImplementedError.
write_indexThis argument is used to write an index to the Stata file
By default, it is True which means the index is always included
byteorderThis argument is used to specify the order in which the binary state file should be written
The default is None but it can also be little and big
time_stamp
This argument is used to include the time at which the document is created
By default, it is the current time
data_labelThis argument is a string that is used to provide a label for the stata file
Must be no longer than 80 characters
variable_labelsThis argument is used when we want the variables in the data frame to become values in the stata file
Each variable must be no longer than 80 characters
versionSpecifies which version to use while writing the data frame in the data file
Can be {114,117,118,119,None}
The default is 114, which is compatible with Stata versions 10 and later
When set to None, it is left to the Pandas library to decide which version to go with
convert_strlThis argument only works if the version is set to 117
Specifies the list of column names to convert to the equivalent columns in Stata format
compressionSpecifies how the file must be compressed for on-the-fly exchanges
Can be ‘gzip’,’bz2′,’xz’,’zstd’,’infer’
infer is the default compression
storage_optionsThese are the extra options used for storing in the form of URLs, HTTP and so on
Examples are host, port, username, password
value_labelsIt is a dictionary containing the columns of the data frame as keys and the column values as labels
Labels must be no longer than 32000 characters
Arguments of Data Frame to Stata

There are a few errors this method would rise in some cases:

NotImplementedError

  • This error might occur when the datetime contains a timezone
  • If the column is not representable in Stata

ValueError

  • This error occurs if the columns listed in the convert_dates are neither in the form of datetime64 nor datetime.datetime
  • The column included in thr convert_dates is not in the data frame
  • If the categorical label contains more than 32000 characters

Exporting a Data Frame to Stata

Let us try to export a data frame to a state format with some examples.

Exporting a Data Frame(From CSV) To Stata

In this example, we are going to take a CSV data set read it as a data frame, and then export it to Stata format.

The CSV dataset we take for this example is from an IPL data set. It has attributes like the seasonId, the year in which the season took place, who won the man of the match in that year, and so on.

import pandas as pd
df=pd.read_csv('Season.csv')
df

Firstly, we have imported the Pandas library to be able to create a data frame. Next, a variable called df is created to store the data frame read from the CSV file- Season.csv.

In the last line, we are printing the data frame.

Season Dataframe
Season Dataframe

Now that we have the data frame, let us try to convert it to Stata format.

df.to_stata('Season.dta')

In the above line of code, we called the method df.to_stata that takes the file name as the argument. Inside this method, we can specify the file name you want to write the output to with dta extension.

When this code is executed, you can see a file called Season.dta being created in your environment.

Data Frame To Stata
Data Frame To Stata

After you get hold of the data file, you can now use it to manipulate and visualize the data using the Stata software.

Exporting a Data Frame to Stata by Specifying the Path

We are going to follow the same steps, create a data frame, and call the method but pass the path argument as a parameter to the method.

import pandas as pd
dct={'Groceries':['Mangoes','Apples','Oranges','Tomatoes','Potatoes'],
'cost':[80,100,70,60,30],
'No_of_units':[2,3,4,1,1]}
df=pd.DataFrame(dct)
df

To explain the code briefly, we have initialized a variable called dct to store a dictionary of grocery items like Mangoes, Apples, Oranges, Tomatoes, and Potatoes their respective costs, and the number of units bought.

This dictionary is then passed to the method pd.DataFrame to render a data frame.This data frame is stored in a variable called df.

This data frame is printed in the next line.

Grocery Dataframe
Grocery Dataframe

The next step is to export this data frame into a dta file.

df.to_stata(path='Groceries.dta')

The path is supplied as an argument for the method df.to_stata. We also specified the file name in which we want the output to be written.

Data Frame To Stata With Path
Data Frame To Stata With Path

We can even preview the dta file with the help of another method –pd.read_stata.

st=pd.read_stata('Groceries.dta')
st

A new variable called st is created to read the stata file. The newly created groceries.dta is passed as input to the read method,

In the next line, we are printing the stata format.

Reading The Stata File
Reading The Stata File

Exporting a Date Dataframe to Stata

In this example, we are going to take a data frame of different dates and export it to dta file using the convert_dates function.

Refer to this article to know how to change the datetime format.

import pandas as pd
df = pd.DataFrame({'date': pd.to_datetime(['2023-01-01', '2023-02-02', '2023-03-03'])})
df

In this code, we are trying to create a data frame with a few dates that must be supported by the datetime format. Hence we used pd.to_datetime to make them compatible.

Dataframe Of Dates
Dataframe Of Dates

In the following step, we try to convert the datetime format to the internal date format of Stata using the convert_dates function.

df.to_stata('dates.dta', convert_dates={'date': 'tc'},data_label='Dates')

In the above code, we used the convert_dates and specified the option to be tc, which means the date in the Stata format will be in the form of a calendar.

This change in the datetime can be noticed in the data file when you view it with the help of Stata software.

Data Frame To Stata Using Convert Dates
Data Frame To Stata Using Convert Dates

Conclusion

To conclude, we have seen how the Stata software has been helping researchers and academicians to store, manipulate and visualize their data and also providing assistance in reporting tasks.

We have also observed that the Pandas library has a special method called df.to_stata which is used to export a data frame to a Stata format. The stata file must be saved with a dta extension.

We understood the basics of a data frame and tried to create data frames from lists and dictionaries. We have also seen their examples respectively.

Next, we explored the syntax of the method in the discussion. We have thoroughly understood the syntax and its arguments.

Coming to the examples, firstly, we have taken a CSV data set, read it in a data frame, and then rendered a dta file out of the data frame.

Next, we created a data frame from a dictionary and specified the path in which we want the output to be written using the path argument.

The Stata software doesn’t just accept any date or time. It has specific options to include the date in the file. In the third example, we created a data frame with dates using the pd.to_datetime method and then used it to render a Stata file with dates supported by it with the help of convert_dates.

References

The method used in this post is available in the official Pandas documentation.

You can visit the official website of Stata to experiment with it.

You can download the CSV data used in this post from here.