Understanding Bootstrap Statistics

Bootstrap Statistics

Let us say that you are a huge fan of cricket and your favorite batsman is going through a bad form and has not been scoring many runs. Now your friend has bet you 1000 rupees that your favorite batsman is not going to score a half-century. This puts you in a dilemma if you should bet or not. Bootstrap statistics helps you to make your decision.

Bootstrap statistics is almost the same as our conventional statistics. The difference is that in bootstrap statistics sampling with a replacement method is followed which means that the same point can be selected multiple times.

In this article, we will learn what bootstrap statistics is and a Python code explaining bootstrap statistics. Let us move ahead and understand bootstrap statistics further.

Recommended: How To Calculate Power Statistics?

What is Bootstrap Statistics?

Bootstrap Statistics is a very powerful weapon for a statistician. It determines the distribution of an event but it comes with a twist. Instead of just drawing up samples, here, we draw samples with replacements. It means that after selecting the respective data point, it is replaced with the same data point. We then resample the data to determine our distribution. Many iterations, typically in thousands are performed for hypothesis testing.

Bootstrap statistics does have some advantages. It is very useful when our dataset is very small and conclusions can be drawn much more easily and provide more accuracy than the traditional method of statistics. It’s also very versatile and can be used in many statistical scenarios.

It does have some disadvantages as well. Bootstrap statistics do not overcome the underlying biases in the dataset. Bootstrap statistics also can be very time-consuming when analyzing complex datasets.

The major difference between conventional statistics and bootstrap statistics is that bootstrap statistics is free of any assumptions as compared to conventional statistics. Bootstrap statistics helps you with variability in the given statistic.

Let us move further and look at a simple Python code involving Bootstrap statistics.

Python Implementation of Bootstrap statistics

Let us look at a very simple code of bootstrap statistics.

import numpy as np
from scipy.stats import norm

# Generate some sample data
data = np.random.randn(100)  # Sample 100 random numbers from a standard normal distribution

# Define the statistic we want to bootstrap (mean)
def statistic(data):
    return np.mean(data)

# Set the number of bootstrap replicates
n_boot = 1000

# Perform bootstrapping
bootstrapped_stats = []
for _ in range(n_boot):
    # Resample the data with replacement (bootstrap sample)
    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
    # Calculate the statistic on the bootstrap sample
    bootstrapped_stat = statistic(bootstrap_sample)
    # Append the statistic to the list
    bootstrapped_stats.append(bootstrapped_stat)

# Calculate the original statistic
original_stat = statistic(data)

# Analyze the bootstrap distribution
boot_mean = np.mean(bootstrapped_stats)
boot_std = np.std(bootstrapped_stats)

# Calculate confidence intervals (e.g., 95%)
lower_ci = boot_mean - 1.96 * boot_std
upper_ci = boot_mean + 1.96 * boot_std

# Print the results
print("Original statistic:", original_stat)
print("Bootstrapped mean:", boot_mean)
print("Bootstrapped standard deviation:", boot_std)
print("95% confidence interval:", lower_ci, "-", upper_ci)

# Visualize the bootstrap distribution (optional)
import matplotlib.pyplot as plt
plt.hist(bootstrapped_stats, bins=20, density=True, label="Bootstrap distribution")
plt.axvline(original_stat, color='r', linestyle='--', label="Original statistic")
plt.axvspan(lower_ci, upper_ci, alpha=0.2, color='b', label="95% confidence interval")
plt.xlabel("Statistic value")
plt.ylabel("Density")
plt.legend()
plt.show()

In the above code, we have generated 100 random numbers from a standard normal distribution curve and then further resampled the data with replacements. We get the following output.

Bootstrap Statistics Output
Bootstrap Statistics Output

We can observe that our original statistic is approximately 0.084. Our bootstrap mean which is the average from calculated bootstrap samples is around 0.073. This test was conducted keeping our confidence interval of 95%.

Conclusion

Here you go! Now you can decide whether to move ahead or not with a 1000 rupee bet with your friend. As mentioned earlier, bootstrap statistics is a very powerful statistical tool but please beware of the size of the dataset. In larger samples, it is not as effective as conventional statistics but it is much more useful with smaller datasets.

Hope you enjoyed reading it!!

Recommended: Introduction to Bootstrap Sampling in Python