Introduction to Bootstrap Sampling in Python

In statistics, Bootstrap Sampling is a method that involves retrieving of subset data repeatedly with replacement from a vast data source to calculate a population parameter.

Sampling is the process of selecting a subset or smaller dataset from a vast collection of data to calculate a certain characteristic of the entire data set. Sampling with replacement means a data point in a selected sample(subset) that can reappear in future selected samples and lastly, the process of estimating the parameters of the entire population on the basis of samples is parameter estimation.

To understand the need for bootstrap sampling let’s consider an example, suppose we want to calculate the average age of 1000 employees working for a particular company the first approach could be to ask all the 1000 employees their age and then calculate the mean age.

This can be a tedious task and a time-consuming approach, another method is to consider a sample of 5 employees and collect their ages this process can be repeated 20 times, and then average the collected age data of 100 employees. This average age would be the estimate of all 1000 employees.

Implementation of Bootstrap Sampling in Python

Bootstrap sampling is a statistical method used to analyze data by repeatedly drawing subsets from a larger dataset and estimating population parameters. In Python, you can use the NumPy library to implement bootstrap sampling. Use np.random.choice() to generate bootstrap samples with replacement, then calculate the mean, standard deviation, or confidence intervals as required.

Example 1: Basic Bootstrap Sampling

import numpy as np

ages = [25, 30, 35, 40, 45, 50, 55, 60, 65, 70]

num_samples = 1000

bootstrap_means = np.zeros(num_samples)

# Perform bootstrap sampling
for i in range(num_samples):

    bootstrap_sample = np.random.choice(ages, size=len(ages), replace=True)
    bootstrap_mean = np.mean(bootstrap_sample)
    bootstrap_means[i] = bootstrap_mean

estimated_mean = np.mean(bootstrap_means)
estimated_std = np.std(bootstrap_means, ddof=1)


print("Estimated population mean age:", estimated_mean)
print("Standard error of the estimate:", estimated_std)

We import numpy library as its alias np . We define a sample of ages in variable ages .We set num_samples = 1000 to generate the bootstrap samples. Then we initialize an array to store the bootstrap means bootstrap_means. np.random.choice(ages, size=len(ages), replace=True) resamples with replacement from the original sample to generate a bootstrap sample.

To calculate the mean age of the bootstrap sample we define np.mean(bootstrap_sample)this calculated mean as stored is in bootstrap_means[i] = bootstrap_mean.Later to calculate the estimated population mean and standard error we define two functions np.mean(bootstrap_means) and np.std(bootstrap_means, ddof=1).

The output will display the estimated population mean age and the standard error of the estimate.

Output:

Example 2: Bootstrap Sampling for Confidence Intervals

#Bootstrap sampling for confidence intervals:
import numpy as np

data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
num_samples = 1000

bootstrap_means = np.zeros(num_samples)

# Perform bootstrap sampling
for i in range(num_samples):
    bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
    bootstrap_mean = np.mean(bootstrap_sample)
    bootstrap_means[i] = bootstrap_mean

confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5])

print("95% Confidence interval:", confidence_interval)

The above code performs bootstrap sampling to estimate a 95% confidence interval for the population mean of the original sample. We define an original sample data and also set the number of bootstrap samples to generate num_samples.bootstrap_means is to initialize an array to store the mean of the sample. To resample with replacement from the original samples so that a bootstrap sample is generated we define bootstrap_sample.And to calculate the mean of Bootstrap mean we define bootstrap_mean. At last, we calculate a 95% confidence interval by taking the 2.5th and 97.5th percentiles of the mean ages of the bootstrap samples.

Output:

Example 3: Two-Sample Bootstrap Hypothesis Test

import numpy as np

group1 = [10, 12, 15, 18, 20]
group2 = [8, 11, 13, 16, 19]

num_samples = 1000

bootstrap_diffs = np.zeros(num_samples)

# Perform bootstrap sampling
for i in range(num_samples):
    bootstrap_group1 = np.random.choice(group1, size=len(group1), replace=True)
    bootstrap_group2 = np.random.choice(group2, size=len(group2), replace=True)
    
    bootstrap_diff = np.mean(bootstrap_group1) - np.mean(bootstrap_group2)
    bootstrap_diffs[i] = bootstrap_diff

p_value = np.mean(bootstrap_diffs >= np.mean(group1) - np.mean(group2))

print("Bootstrap p-value:", p_value)

In the third example, we perform a two-sample bootstrap hypothesis test to determine whether there is a significant difference between the means of two independent groups. We define two groups, group 1 and group 2. We set the number of bootstrap samples to be generated to 1000 at variable num_samples.To initialize an array to store the difference in means of each bootstrap sample we define np.zeros(num_samples) function.bootstrap_group1 and bootstrap_group2 define functions that will resample with replacement from the two groups. p_value calculates the p-value, which is the proportion of bootstrap samples with a difference in means greater than or equal to the difference in means of the original samples.

Output:

Summary

Bootstrap sampling is a powerful technique for statistical analysis in Python. It allows you to estimate population parameters with a smaller dataset, increasing efficiency and reducing complexity. What are some other applications of bootstrap sampling in your field?

You can browse more interesting articles: