NumPy Python: Calculating Auto-Covariance

Numpy is a go-to tool used for statistics, and auto-covariance is a statistical concept. In this article, we shall study how we can calculate auto-covariance using NumPy.

Definition of Auto-Covariance

Auto-covariance is a concept used in statistics that is used to calculate covariance in a time series and its lagged version at various points in time.

Auto-covariance is a vital tool to understand periodic behavior in time series data. Enables us to find out various factors such as trend analysis, seasonality, and others.

Introduction to NumPy

NumPy is a powerful library of Python, and it has gained popularity due to its various features and ability to handle huge mathematical calculations with ease.

It is very easy to pick and use and is loaded with various functions that help us to perform complex operations with a breeze.

To take a deep dive into the NumPy library check out this article in this link.

Applications of auto-covariance in data analysis and time series forecasting

Auto-covariance is an important tool as it becomes essential in carrying out various tasks in data analysis and time series forecasting.

Auto-covariance helps in understanding seasonal trends by calculating the repetitive patterns within a period at certain lags.

It can be used while implementing Machine Learning models while tuning certain hyperparameters, and improving the performance and the quality of results produced by that model.

Overall, auto-covariance is a vital tool that we can resolve for a wide range of applications in various types of different domains.

Understanding Covariance

Definition of Covariance

Covariance is a statistical tool that measures the extent to which two variables change together. In data analysis, covariance shows the relationship between two collections of data points and tells if they might increase or decrease together.

If one variable increases when the other does, it is called positive covariance, likewise, if one variable decreases when the other variable is increasing, it is negative covariance.

If there can be no change in one variable based on the other, this is zero covariance.

Covariance can be expressed in terms of a mathematical equation as shown below:

where

x(bar) = mean of X

y(bar) = mean of Y

X_i= individual data points of x

y_i= individual data points of y

Covariance Matrix

A covariance matrix is an arrangement of covariances of pairs of variables when working with multiple variables.

A general form of the Covariance matrix is shown below:

Covariance Vs. Auto-Covariance

Auto-covariance compares a single time series with its own previous and upcoming data values to understand temporal dependency, whereas covariance compares two different variables to understand the effect of one variable on the other.

Auto-Covariance Concept

Idea Behind Auto-Covariance

The concept of auto-covariance emerged from the desire to understand the relationship between the previous and upcoming values in a time series.

Time series is nothing but a sequential observation over some time that reflects certain characteristics. These characteristics could be temporal dependencies, trends, and periodic patterns.

Auto-covariance helps us to calculate and understand these characteristics of the time series data.

To measure the correlation between different time points in a time series was the actual idea behind finding auto-covariance.

It is a vital tool when it comes to handling time series data, and its applications range to a wide variety.

Mathematical Representation

Auto-covariance at a given lag in a time series can be presented as follows:

Where,

X = time series,

k = lag (lag is nothing but the duration of time between two observations that are being compared)

X_t= value at time t.

x(bar) = mean of time series

Auto-covariance is used to understand how the correlation changes over a period of time between the time series and its lagged versions.

Calculating Auto-Covariance using NumPy

‘numpy.cov()’

To use the numpy.cov(), we will need the lagged version of the time series data and then combine both the time series data and the lagged data to create a 2D array.

After this pre-processing, we can pass this data to the numpy.cov() function.

To begin working, we will need to install NumPy in our system. To do so, we will use pip which is the default package manager for Python. To know more about pip read this article.

To install NumPy, run the below command:

pip install numpy

Single Time Series:

We will understand how to use the numpy.cov() function for a single time series with the help of a simple example as below.

I will be using Google Colab as my environment for this article.

import numpy as np 

data = np.array([5,4,3,2,1])
lagged_data = np.roll(data, 1) 

pre_processed_data = np.vstack((data, lagged_data)) 

cov_mat = np.cov(pre_processed_data).   

print("Covariance matrix for auto-covariance (single time series): ")

print(cov_mat)

In the above code, we have started off by first importing the NumPy library as np.

Then we created a simple array of five elements using the ‘np.array()’ function, which we have named as data that we will be treating as our time series data.

After that, we create have created another array but used the same elements that were used in the ‘data’ array. The only difference is that we have shifted all the elements by one place using the ‘np.roll()’ function. We used the second array as the lag for the time series data; hence we named it ‘lagged_data’.

Now we have the two required arrays: the time series,i.e., ‘data’, and the lag,i.e., ‘lagged_data’. After this, we will combine the two arrays and convert them into a 2D array using the ‘np.stack()’ function, because that is the required format to calculate auto-covariance, as we have seen in the definition.

Lastly, we have passed the 2D array that we just created to the ‘np.cov()’ function, to generate the covariance matrix of auto-covariance.

As you can see that we clubbed the two arrays, and then we were able to generate a covariance matrix of auto-covariance.

Multiple Time Series

We will understand how to use the numpy.cov() function for a Multiple time series with the help of a simple example as below.

import numpy as np

data1 = np.array([1, 2, 3, 4, 5])
data2 = np.array([2, 3, 4, 5, 6])

lagged_data1 = np.roll(data1, 1)
lagged_data2 = np.roll(data2, 1)

pre_processed_data = np.vstack((data1,lagged_data, data2, lagged_data2))

cov_mat = np.cov(pre_processed_data)

print("Covariance matrix for auto-covariance (multiple time series):")

print(cov_mat)

In the above code, we have started off by first importing the NumPy library as np.

Then created two simple arrays of five elements using the ‘np.array()’ function, which we have named as data1 and data2, which we will be treating as our time series data.

After that, we create have created two more arrays but used the same elements that were used in the ‘data1’ and ‘data2’ arrays. The only difference is that we have shifted all the elements by one place using the ‘np.roll()’ function. We used the arrays that we created later as the lag for the time series data; hence we named them ‘lagged_data1’ and ‘lagged_data2’.

Now we have the required arrays: the time series and the lag. After this, we will combine these arrays and convert them into a multi-dimensional array using the ‘np.stack()’ function, because that is the required format to calculate auto-covariance, as we have seen in the definition.

Lastly, we have passed the multi-dimensional array that we just created to the ‘np.cov()’ function, to generate the covariance matrix of auto-covariance.

You will have a similar output when you successfully execute the above code.

‘numpy.correlate()’

What is cross-correlation?

Cross-correlation is a mathematical tool used to quantify the likeness of two sequences. In the context of time series, it is used to find the extent to which the time series and its lagged version are similar.

Usually used in problems where we need to find time shifts and identify patterns and periodicities between a pair of sequences.

Computing Auto-Covariance using Cross-Correlation.

We will understand the process with the help of an example:

import numpy as np

data = np.array([1, 2, 3, 4, 5])

auto_cov = np.correlate(data - data.mean(), data - data.mean(), mode='full')

positive_lags = np.arange(len(data))
auto_cov = auto_cov[len(data) - 1:]

print("Auto-covariance values at different lags:")
print(auto_cov)

The above code is similar to what we have done in the previous examples by using ‘np.cov()’. You can take a deeper dive into how the np.correlate() function works; you can read through this article.

In the above example, we have treated ‘data’ as the time series and calculated auto-covariance values at different time lags using cross-correlation.

Visualizing Auto-Covariance

Since we have gained all the theoretical knowledge about auto-covariance and also understood how it is calculated using NumPy, we have a good understanding of the topic, but at times when we visually see things, we tend to understand them better.

We shall plot some graphs with the help of another Python Library called Matplotlib, which is used to visualize data to gain a better understanding of this topic.

To know more about Matplotlib, here is an extensive article in the link.

pip install matplotlib

Time Series Plot

To start with, we will start by plotting a time series plot; with the help of the code below, you can plot a time series plot:

import numpy as np
import matplotlib.pyplot as plt

time_series = np.array([1, 2, 3, 4, 5])

time = np.arange(len(time_series))

plt.plot(time, time_series, marker='o', linestyle='-')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title('Time Series Plot')
plt.grid(True)
plt.show()

The above code performs the task of plotting a graph of a time series using matplotlib.

We simply import the libraries as shown above.

We have created an array that we will treat as time series, and we need an integer for the x-axis of the graph, which we will take as time because we are plotting a graph of increase in value with an increase in time and we will plot time on the x-axis.

Then we created a plot using ‘plt.plot()’ and passed its parameters as per our requirement. After that, we altered certain attributes of the plot, such as naming the x and y axis as ‘Time’ and ‘Value’, then setting the title as ‘Time Series Plot’, and lastly used the ‘plt.show()’ command to display the plot.

You can take a look at the time series plot in the example shown in the image below:

Auto-Covariance with Lag

Moving on, we will also take a look at plotting an auto-covariance with a lag plot with the help of an example as shown below:

import numpy as np
import matplotlib.pyplot as plt

auto_covariance = np.array([2.5, 1.2, 0.8, 0.4, -0.2])

lags = np.arange(1, len(auto_covariance) + 1)

plt.stem(lags, auto_covariance, use_line_collection=True)
plt.xlabel('Lag')
plt.ylabel('Auto-covariance')
plt.title('Auto-covariance Plot with Lag')
plt.grid(True)
plt.show();

Plotting the auto-covariance plot with lag is the same as plotting the time series plot. The only difference is that we have replaced Value with Auto-Covariance and Time with Lag. The rest of the procedure remains the same as performed for the time series plot.

You can see an example of the auto-covariance with a lags plot in the image below:

By plotting your results with the help of matplotlib you can better understand as well as explain the results of the calculations and operations that you have performed.

Plots are always pleasant to see and easy to understand.

Real-World Applications

Auto-covariance is a tool that has many real-world applications that range from stock market analysis to climate forecast analysis. Investors make use of the patterns generated using auto-covariance to make investments in the stock market.

These patterns are helpful to understand and make calculative decisions about trading strategies, portfolio improvement, and other factors.

Important knowledge related to climate and weather can be obtained with the help of auto-covariance.

Considering all the applications and use cases it can be understood that auto-covariance helps a lot by guiding while decision-making and enhancing the quality of the decisions we make with the help of the patterns and trends that are generated.

Summary

To finish off with this article, let’s revise whatever we have covered in this article. To begin with, we started with understanding the basic concepts, the definition of auto-covariance, and its importance in the field of statistics.

We understood what covariance is and also took an overview of the mathematical part of the concept.

Later we came towards auto-covariance, understood it in depth, and we covered the mathematical concepts of it, and then we started with actually implementing it as code using NumPy and Python.

We saw a few examples of how we could visualize our results with the help of matplotlib.

Lastly, we just brushed up our knowledge about how auto-covariance is used in the real world and what could be the potential use cases of auto-covariance.

Reference

Stackoverflow Query