In statistics, Bootstrap Sampling is a method that involves retrieving of subset data repeatedly with replacement from a vast data source to calculate a population parameter.
Sampling is the process of selecting a subset or smaller dataset from a vast collection of data to calculate a certain characteristic of the entire data set. Sampling with replacement means a data point in a selected sample(subset) that can reappear in future selected samples and lastly, the process of estimating the parameters of the entire population on the basis of samples is parameter estimation.
To understand the need for bootstrap sampling let’s consider an example, suppose we want to calculate the average age of 1000 employees working for a particular company the first approach could be to ask all the 1000 employees their age and then calculate the mean age.
This can be a tedious task and a time-consuming approach, another method is to consider a sample of 5 employees and collect their ages this process can be repeated 20 times, and then average the collected age data of 100 employees. This average age would be the estimate of all 1000 employees.
Implementation of Bootstrap Sampling in Python
Bootstrap sampling is a statistical method used to analyze data by repeatedly drawing subsets from a larger dataset and estimating population parameters. In Python, you can use the NumPy library to implement bootstrap sampling. Use np.random.choice() to generate bootstrap samples with replacement, then calculate the mean, standard deviation, or confidence intervals as required.
Example 1: Basic Bootstrap Sampling
import numpy as np ages = [25, 30, 35, 40, 45, 50, 55, 60, 65, 70] num_samples = 1000 bootstrap_means = np.zeros(num_samples) # Perform bootstrap sampling for i in range(num_samples): bootstrap_sample = np.random.choice(ages, size=len(ages), replace=True) bootstrap_mean = np.mean(bootstrap_sample) bootstrap_means[i] = bootstrap_mean estimated_mean = np.mean(bootstrap_means) estimated_std = np.std(bootstrap_means, ddof=1) print("Estimated population mean age:", estimated_mean) print("Standard error of the estimate:", estimated_std)
numpy library as its alias
np . We define a sample of ages in variable
ages .We set
num_samples = 1000 to generate the bootstrap samples. Then we initialize an array to store the bootstrap means
np.random.choice(ages, size=len(ages), replace=True) resamples with replacement from the original sample to generate a bootstrap sample.
To calculate the mean age of the bootstrap sample we define np.mean(bootstrap_sample)this calculated mean as stored is in
bootstrap_means[i] = bootstrap_mean.Later to calculate the estimated population mean and standard error we define two functions
The output will display the estimated population mean age and the standard error of the estimate.
Example 2: Bootstrap Sampling for Confidence Intervals
#Bootstrap sampling for confidence intervals: import numpy as np data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] num_samples = 1000 bootstrap_means = np.zeros(num_samples) # Perform bootstrap sampling for i in range(num_samples): bootstrap_sample = np.random.choice(data, size=len(data), replace=True) bootstrap_mean = np.mean(bootstrap_sample) bootstrap_means[i] = bootstrap_mean confidence_interval = np.percentile(bootstrap_means, [2.5, 97.5]) print("95% Confidence interval:", confidence_interval)
The above code performs bootstrap sampling to estimate a 95% confidence interval for the population mean of the original sample. We define an original sample
data and also set the number of bootstrap samples to generate
bootstrap_means is to initialize an array to store the mean of the sample. To resample with replacement from the original samples so that a bootstrap sample is generated we define
bootstrap_sample.And to calculate the mean of Bootstrap mean we define bootstrap_mean. At last, we calculate a 95% confidence interval by taking the 2.5th and 97.5th percentiles of the mean ages of the bootstrap samples.
Example 3: Two-Sample Bootstrap Hypothesis Test
import numpy as np group1 = [10, 12, 15, 18, 20] group2 = [8, 11, 13, 16, 19] num_samples = 1000 bootstrap_diffs = np.zeros(num_samples) # Perform bootstrap sampling for i in range(num_samples): bootstrap_group1 = np.random.choice(group1, size=len(group1), replace=True) bootstrap_group2 = np.random.choice(group2, size=len(group2), replace=True) bootstrap_diff = np.mean(bootstrap_group1) - np.mean(bootstrap_group2) bootstrap_diffs[i] = bootstrap_diff p_value = np.mean(bootstrap_diffs >= np.mean(group1) - np.mean(group2)) print("Bootstrap p-value:", p_value)
In the third example, we perform a two-sample bootstrap hypothesis test to determine whether there is a significant difference between the means of two independent groups. We define two groups, group 1 and group 2. We set the number of bootstrap samples to be generated to 1000 at variable num_samples.To initialize an array to store the difference in means of each bootstrap sample we define np.zeros(num_samples) function.bootstrap_group1 and bootstrap_group2 define functions that will resample with replacement from the two groups. p_value calculates the p-value, which is the proportion of bootstrap samples with a difference in means greater than or equal to the difference in means of the original samples.
Bootstrap sampling is a powerful technique for statistical analysis in Python. It allows you to estimate population parameters with a smaller dataset, increasing efficiency and reducing complexity. What are some other applications of bootstrap sampling in your field?
You can browse more interesting articles: