Understanding sampling distributions
Let us first assume that we have access to a population that has a normal distribution. Imagine you take a sample of say size 10. You can calculate various descriptive statistics for this sample such mean , sd , median or variance etc. The sample size is 10 and number of samples is 1.
Now if you take 20 such samples each of sample size 10 as above , we can calculate the descriptive statistics like mean , median etc as we did for the first sample above . Say we calculate the mean alone . So we now have the means of 20 samples , each of which have a sample size of 10. Please do not confuse the sample size [ here 10] with number of samples [ here 20 at present]
If we plot the means of the 20 samples , we have what is called a sampling distribution. This is a theoretical distribution in which the number of samples can be imagined to extend from 20 in our case above to infinity.
There are several important features of a sampling distribution.
As the number of samples increases from 20 to infinity the sampling distribution will approximate the normal curve. This is not surprizing as in our case we have been sampling from a population with a normal distribution. What is interesting is that even if the population is not a normal distribution as the sample size approximates infinity , the sampling distribution will approximate a normal distribution. This behaviour of the sampling distribution is called Central Limit Theorem and is the fundamental idea in statistics.
Any statistic of the sample can have its sampling distribution. The above description is sampling distribution of the mean, we can similarly have sampling distribution of the median, sampling distribution of the sd etc. The mean being the most mathematically malleable statistic it is usually used.
There are several derivations from the sampling distribution that is used for statistical testing , but let us now stop and understand better the behavior of the sampling distribution.
From the above discussion we can identify several variable components:
The size of the sample. here 10 . but it could be any practical number .
Number of samples. we began with one , then increased to 20 and theoretical can be extended to infinity .
The particular statistic computed from the sample: we computed the mean , but we could use the median etc.
The distribution of the population: here we assumed to know it is a normal distribution , but we have stated as the central limit theorem that it is immaterial what the population distribution is , the sampling distribution will always approximate the normal distribution as the number of samples increases.
Now this is only a theoretical exercise: in practice we do not take infinite number of samples , not even twenty samples but only one sample . what is in your control is only the sample size and not the number of samples. We assume it to be infinite.
Now , the sampling distribution not only gives us a assurance of a normal distribution which is a required assumption of all parametric tests, but also provides us with a new statistic called the standard error.
The sampling distribution is a distribution of sample means. We can thus calculate the mean and standard deviation of the sampling ditribution also.The standard deviation of the sampling distributions is called the standard error. The central limit theorem also provides us other assumptions:
The mean of the sampling distribution will approximately the mean of the population.
The standard deviation of the population can be derived from its relationship to the standard error : SD = SE/ sqrtN. Here is sample size not number of samples.
Now this is only a theoretical exercise: in practice we do not take infinite number of samples , not even twenty samples but only one sample . what is in your control is only the sample size and not the number of samples. We assume it to be infinite. Now the question is what happens if we increase the sample size.
We will explore the effect of change in sample size on the standard error with a help of a simulation in excel.
Determine how the standard error is affected by sample size. Plot the standard error of the mean as a function of sample size for different standard deviations? Can you discover a formula relating the standard error of the mean to the sample size and the standard deviation? If so, see if it holds for distributions other than the normal distribution.
Redo #2 above for the median. Find a distribution/sample size combination for which the sample median is a biased estimate of the population median. Is the sample variance an unbiased estimate of the population variance? If not, see if you can find a correction based on sample size. Does the correction hold for distributions other than the normal distribution?
A statistic is unbiased if the mean of the sampling distribution of the statistic is the parameter. Test to see if the sample mean is an unbiased estimate of the population mean. Try out different sample sizes and distributions.
For what statistic is the mean of the sampling distribution dependent on sample size?
For a normal distribution, compare the size of the standard error of the median and the standard error of the mean. Find a relationship that holds (approximately) across sample sizes?
Does this relationship hold for a uniform distribution?
Find a distribution for which the standard error of the median is smaller than the standard error of the mean. (You may find this difficult, but don’t give up.)
Compare the standard error of the standard deviation and the standard error of the mean absolute deviation from the mean (MAD). Does the relationship depend on the distribution?