February 16, 2021

Bootstrapping and Central Limit Theorem

Today's topic is bootstrapping. It's one of the resampling approaches that is frequently used instead of statistical inference. This method is based on the Central Limit Theorem.

Central Limit Theorem states that if you take sufficiently enough random samples with replacement from the data, then the samples' means will approximate normal distribution no matter what is the original distribution.

The approximation relies on two parameters which are sample size and the number of samples. If the distribution of your data is close to normal, then it will be enough to pick smaller parameters to achieve a good approximation with the less computational time needed.

Bootstrapping works the same way. The snipped code of bootstrapping is presented below.

import numpy as np

def bootstrap(X, loops, sample_size, func):
    res = []
    sample_size = int(X.shape[0] * sample_size)
    for i in range(loops):
        res.append(
            func(
                np.random.choice(
                    X,
                    size=sample_size,
                    replace=True
                )
            )
        )
    return res

The algorithm is simple. You take a sample with replacement out of the range of observations X to calculate the metric you are interested in. Then you record the result and repeat the loop N times.

Bootstrap Parameters

To achieve a good approximation within a reasonable time you have to decide on how many samples (or iterations) you need and what should be the size of each sample.

To pick the right number of samples you can use the following principle - the closer your distribution to normal, the fewer samples you need. It's a kind of trade-off between computational costs and a good approximation.

As for the size of samples, I found this issue heatedly debated. Some sources stated that you should take smaller sizes. The others argue that bootstrap is a large sample approach and it is better to pick the sample size close to the size of the data it is derived from.

Let's illustrate how it works. I have simulated the distribution that is obviously far from normal.

The simulated distribution of data that is far from normal

I fixed the number of iterations and changed the sample size only recording the p-values of the normality test to check the normality of the mean's distribution. The null hypothesis states that it is normal, so p-values higher than 0.01 and 0.05 report non-normal distributions.

Changing approximation with increasing the sample size

You can see the smaller sample sizes result in breaking the normality assumptions from time to time. However, when they become larger or equal to the size of the original population, the bootstrap is converged and the distribution of sample means becomes constantly normal.

The convergence to normality depending on the sample size

However, this convergence requires higher computations costs. Bigger sample sizes result in more time needed to bootstrap the data.

Increasing time to compute the bootstrap with the sample size

Since the distribution seems to be stable starting from the sample size of 500 you can proceed with this number to reach your goal at a reasonable time.

This time let's try to keep the sample size fixed and only change the number of samples.

Changing approximation increasing the number of samples

The bootstrap seems to be converged starting from 30k of samples. It meets the basic assumption of normality and it approximates well the original population in terms of the standard deviations and means in samples.

The convergence to normality depending on the number of samples
Increasing time to compute the bootstrap with the number of samples

To sum up, I want to note that picking rights parameters for bootstrap is a kind of trade-off between computational costs and a good approximation of samples. Having selected a set of parameters, analysts should conduct a series of additional experiments to ensure that their conclusion doesn't depend on the sample size or the number of iterations.

Bootstrap VS Statistical inference

At the beginning of the post, I mentioned that bootstrap is frequently used as an alternative to statistical inference. In fact, statistical inference is the preferred approach. For example, if we are confident that data in both two groups is normally distributed then we will rather use T-Test than resampling to show the difference in means.

There are a lot of statistical tests for different kinds of distributions and various metrics. However, sometimes the distribution of data doesn't fit any well-studied law or the target metric is not a typical one. In this case, such resampling techniques as bootstrap might be the only solution.

Let's have a look at the following example. I've created two random variables that are normally distributed with a significant difference in means.

import numpy as np
X1 = np.random.normal(50, 15, 1000)
X2 = np.random.normal(20, 10, 1000)
An example of two normally distributed groups

Normality tests prove that both distributions of X1 and X2 are normal.

from scipy.stats import normaltest
print(normaltest(X1)) # (statistic=1.0509, pvalue=0.591)
print(normaltest(X2)) # (statistic=1.2180, pvalue=0.543)

That allows us to conduct T-Test.

from scipy.stats import ttest

print(ttest_ind(X1, X2)) 
# Ttest_indResult(statistic=52.4071926223475, pvalue=0.0)

T-Test shows that there is a statistically significant difference between the means of the two groups.

Now let's have a look at the result of bootstrapping with 10000 samples and sample sizes which equal the sizes of the X1 and X2 respectively.

As you can notice, the result demonstrates the difference in the means of the two groups. Thus, we came to the same conclusion that we made with T-Test.

Case Study

Let's apply the bootstrap to the simulated example. We have two groups of people and we measured and recorded the height of every person.

The distributions of heights in groups A and B are presented below.

Histogram of heigh distribution for Group A
Histogram of heigh distribution for Group A

As you can see the distributions are not normal and you can conduct normality tests to check if it's really so.

from scipy.stats import normaltest

print(normaltest(X1), normaltest(X2)

# (
# NormaltestResult(statistic=343.810469248749, pvalue=2.2004180173005142e-75),
# NormaltestResult(statistic=8.66585727465625, pvalue=0.013129040932373876)
# )

Both two tests reject the null hypothesis of the normality of distributions.

The average values of height in groups A and B are 165.14 respectively. If we adopt a naive approach, we assume that the height in group A is 3cm bigger compared to group B. However, is our conclusion statistically significant? Is there any chance that we went wrong because of randomness? The naive approach doesn't have answers to these questions.

Now we assume that we don't trust naive conclusions and we can't use statistical inference (due to non-normal distributions). Let's try to use the bootstrap instead.

Bootstrapped means in groups A and B

Using 100 000 samples with the sample size equals to the original size of data we got the distributions presented above. They are different, but since they intersect each other we can't entirely rely on this difference without putting forward a new concept of the confidence intervals.

Confidence Intervals

In statistics confidence intervals refer to a range of values within which we can assume that our hypothesis is true.

For example, we have a range of values that is [1, 10, 10, 10, 15, 20, 955]. The median equals 10 meaning that 50% of observations will be higher or lower than 10. If we calculate 5% and 95% quantiles for the range we will get 3.7 and 674.5 respectively. These numbers are actually a boundary of our confidence interval with the 95% confidence level.

95% confidence interval means that we trust the values lying between 5% of the rarest observations from both sides of distribution.

Let's calculate the confidence intervals for the sample means of the heights in groups A and B.

import numpy as np

b1 = bootstrap(X1, 100000, X1.shape[0], np.mean)
b2 = bootstrap(X2, 100000, X2.shape[0], np.mean)

# 95% confidence
print(np.quantile(b1, [0.05, 0.95])) # array([164.34643921, 165.48089919])
print(np.quantile(b2, [0.05, 0.95])) # array([162.28512272, 164.9438594 ])

# 99% confidence
print(np.quantile(b1, [0.01, 0.99])) # array([164.1094731 , 165.71734362])
print(np.quantile(b2, [0.01, 0.99])) # array([161.71811551, 165.5030704 ])

The results depend on the confidence level which we pick. Once the level is 95% we see that samples' means in group A varies from 164 to 165 while in group B they range from 162 to 164. This means that average values of A and B don't intersect each other except for 5% of rarest cases.

Having increased the confidence level up to 99%, we got opposite results. Now the samples' means in group A varies from 164 to 165 while in group B they range from 161 to 165. The intersection of two ranges shows the difference between the two groups is not significant.

The researcher should pick the confidence level on the basis of business constraints because every percent of accuracy is connected to business value and potential losses of making wrong decisions.

Step-By-Step Guide

To sum up, you have to go through the following steps when bootstrapping :

  1. Describe your task in terms of statistics. What metrics do you want to calculate? If it's mean, variance, or something simple, then the statistical inference should be preferred.
  2. Check the distribution of your data. Is distributed by one of the well-studied laws (Normal, Poisson, Exponential, etc)? If so, then the statistical inference should be preferred.
  3. If you answered no to the first two questions, then there is a 90% chance that you should consider bootstrap as a tool for your task.
  4. Try to bootstrap with different sample sizes and the number of samples (start with lower figures and increase them until the bootstrap is converged).
  5. Calculate the confidence intervals and be critical about the conclusions you got. Try to change the parameters and check whether your conclusion changes.