Standard Error & Confidence Interval
Mean is probably the most widely used point estimate, but it alone only conveys parts of the story. Another thing we would like to know is how good/accurate this estimate is. Standard error and confidence interval (CI) can be used to answer this question.
§Multiple sampling
Given a population, one would create M samples of size N, and calculate the mean for each sample (over N sample points). With this M means, one can calculate its mean and standard deviation; this mean is the mean estimate of the population, and the standard deviation is the standard error. (Yes, standard error is a form of standard deviation.) With this M means, one can calculate the confidence interval using percentile method, identifying the interval containing 95% data points in the middle basically.
§Bootstrapping single sample
Creating M samples could be too expensive to be practical though, so one alternative is to use bootstrapping to simulate sampling M times from a single sample. Once we have M samples, we can follow the same procedure above.
§Matlab/Octave Code
Code for two approaches is shown below. The population follows normal distribution (0, 1), so std_1
, the standard
deviation of the first sample, is ~1. For samples of size 100, standard error is ~0.1; for samples of size 400, standard
error is ~0.05. Such relation is not accidental; standard error decreases by 1/sqrt(N)
. Since both standard error and
confidence interval reflects the accuracy, they share the same relation: CI length decreases by 1/sqrt(N)
as well. The
intuition here is that standard error/CI tracks how far the estimate is from the true value, and larger sample size (N)
gives us a better estimate, i.e. smaller standard error/CI.
1 | % just to convince octave that this is a local function |
Output is:
1 | multiple sampling |
§Quiz: How about M? Does increasing M affect the standard error/CI?
Yes and no. Increasing M means we use more samples, so we should get better estimate, but estimate of what? Larger M means we get better estimate of the estimate of the true value, in contrast to better estimate of the true value. The latter is what we are after, and it’s tracked by standard error.
In the script above, we just used M=10000
, much larger than N
, to ensure the estimate of the estimate is good
enough.