Demonstrate understanding of the sampling distribution of a statistic.
Explain how the central limit theorem applies in inference.
Determine whether a sampling distribution is approximately a normal distribution.
Calculate key characteristics (mean, standard error) of the sampling distribution of a statistic.
Estimate the probability of an event using the sampling distribution.
When using sample statistics to estimate population parameter, there will be a chance error $$\text{Population Parameter}=\text{Sample Statistic}+\text{Chance Error}.$$
To understand the chance error, we need to know how sample statistics distribute. Consider samples of the same size \(n\) randomly chosen from the population with replacement.
The probability distribution of a sample statistic is called a sampling distribution.
The sampling distribution varies as the sample size changes. In general, A larger sample size will result a smaller standard deviation of the sampling distribution.
The standard deviation of a sampling distribution is also called the standard error.
As the sample size \(n\) increases, the sampling distribution of the sample mean, from a population with the mean \(\mu\) and the standard deviation \(\sigma\), will approach to a normal distribution with the mean \(\mu_{\bar{X}}=\mu\) and the standard deviation \(\sigma_{\bar{X}}=\dfrac{\sigma}{\sqrt{n}}\).
In terms of standardization, the central limit theorem says that the random variable \(\bar{Z}=\dfrac{\bar{x}-\mu}{\sigma/\sqrt{n}}\) has an approximately standard normal distribution.
For most distributions (not highly skewed), when sample size \(n>30\), the sampling distribution of the sample mean \(\bar{X}\) can be approximated reasonably well by a normal distribution. The larger the sample size, the better the approximation will be.
When the population is normally distributed, the sampling distribution of the sample means will be normally distributed for any sample size.
If the population distribution is highly skewed, relying on CLT can be risky.
See https://stats.stackexchange.com/questions/3734 for a discussion on intuitive explanation.
Randomly draw samples of size 2 with replacement from the numbers 1, 3, 4.
Solution: Using the Excel function AVERAGE()
, we may find means of samples and means of sample means.
Using the Excel function STDEV.P()
, we may find the standard deviation of the population and the standard deviation of sample means.
\(\color{red}{\mu}\) | \(\color{red}{\sigma}\) | \(\color{blue}{\mu_{\bar{X}}}\) | \(\color{blue}{\sigma_{\bar{X}}}\) |
---|---|---|---|
2.7 | 1.25 | 2.7 | 0.88 |
sample | (1,1) | (1,3) | (1,4) | (3,1) | (3,3) | (3,4) | (4,1) | (4,3) | (4,4) |
---|---|---|---|---|---|---|---|---|---|
\(\bar{X}\) | 1 | 2 | 2.5 | 2 | 3 | 3.5 | 2.5 | 3.5 | 4 |
It can be verified that \(\mu_{\bar{X}}=\mu\) and \(\sigma/\sqrt{n}=1.25/\sqrt{2}\approx 0.88=\sigma_{\bar{X}}\).
The following are the distribution of the population and the distribution of sample means.
Suppose the mean length of time that a caller is placed on hold when telephoning a customer service center is 23.8 seconds, with standard deviation 4.6 seconds. Find the probability that the mean length of time on hold in a random sample of 1,000 calls will be within 0.5 second of the population mean.
Solution: Since the sample size \(n=1000>30\) is large enough, by the Central Limit Theorem, we know that the mean length of time is approximately normally distributed.
NORM.DIST(24.3, 23.8, 4.6/SQRT(1000),TRUE)-NORM.DIST(23.3, 23.8, 4.6/SQRT(1000),TRUE)
Suppose speeds of vehicles on a particular stretch of roadway are normally distributed with mean 36.6 mph and standard deviation 1.7 mph.
Solution: Since the population is normally distributed \(\mu=36.6\) and \(\sigma=1.7\), the sampling distribution of the sample mean is also normal distributed but with \(\mu_{\bar{x}}=\mu=36.6\) and \(\sigma_{\bar{X}}=\sigma/\sqrt{n}=1.7/\sqrt{10}\).
The probability that the speed of a vehicle is between 35 and 40 is \(P(35< X< 40)=P(X< 40)-P(X<35)\approx 0.8039\) which can be obtained by
NORM.DIST(40, 36.6, 1.7, TRUE)-NORM.DIST(35, 36.6, 1.7, TRUE)
.
The probability getting a sample of size 10 with the mean between 35 and 40 is \(P(35<\bar{X}< 40)=P(\bar{X}< 40)-P(\bar{X}<35)\approx 0.9985\) which can be obtained by
NORM.DIST(40, 36.6, 1.7/SQRT(10), TRUE)-NORM.DIST(35, 36.6, 1.7/SQRT(10), TRUE)
The proportion of a specific characteristic in a data set can be viewed as the mean of the data set by identifying the specific characteristic with 1 and others with \(0\).
Example: Consider the following data set
1, 0, 1, 1, 0, 0, 1, 0, 1, 1
The proportion of red numbers is \(\frac{6}{10}=0.6\) which is the same as the mean of the data set: \(\frac{6\cdot 1 + 4\cdot 0}{10}=0.6\).
Consider a population consisting of 1s and 0s. Let \(p\) be the proportion of 1s. Then standard deviation is $$\sigma=\sqrt{(1-p)^2p+(0-p)^2(1-p)}=\sqrt{p(1-p)}.$$
For a sampling distribution of sample proportion, we write \(\hat{P}\) for the random variable of sample proportions.
For large samples, the distribution of sample proportions \(\hat{P}\) is approximately normal, with the mean \(\mu_{\hat{P}}=p\) and standard deviation \(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\), where \(p\) is the population proportion.
As a sample proportion is always between 0 and 1, and 99.7% of sample proportions lie within 3 standard deviation away from the population proportion, when using the central limit theorem for proportion, we require the sample size \(n\) satisfying the following condition: the interval \(\left[p-3\sqrt{\frac{p(1-p)}{n}}, p+3\sqrt{\frac{p(1-p)}{n}}\right]\) lies wholly in the interval \([0, 1]\).
In practice, if \(n\) satisfies the following two inequalities: \(np\ge 10\) and \(n(1-p)\ge 10\), then we consider \(n\) is large enough for assuming that the sampling distribution of the sample proportion is approximately normal.
When the population proportion \(p\) is unknown, to apply the central limit theorem for proportion, we require the sample size \(n\) satisfying the same conditions with \(p\) replaced by the sample proportion \(\hat{p}\). That is, the sample size \(n\) should satisfies \(n\hat{p}\ge 10\) and \(n(1-\hat{p})\ge 10\).
Suppose that in a population of voters in a certain region 53% are in favor of a particular law. Nine hundred randomly selected voters are asked if they favor the law.
Find the probability that the sample proportion computed from a random sample of size 900 will be at least 2% above true population proportion.
Solution: We first verify that the sampling distribution is approximately normal.
Since \(p=0.53\) and \(n=900\), \(np=900\cdot 0.53>10\) and \(n(1-p)=900(1-0.53)>10\). By the central limit theorem, the sampling distribution is approximately normal.
The standard deviation of the sampling distribution is \(\sigma_{\hat{P}}=\sqrt{\frac{0.53(1-0.53)}{900}}\approx 0.017\).
Then the probability that the random sample has a proportion at least 2% above 53% is
\(P(\hat{P}>0.55)=1-P(\hat{P}\le 0.55)\approx 0.1197\)
which can be obtained by
1-NORM.DIST(0.55, 0.53, SQRT(0.53*(1-0.53)/900),TRUE)
.
Suppose that in 36% of all car accidents involve injury. Find the probability that the injury rate in a random sample of 250 car accidents is between 30% and 45%.
Solution: The injury rate of all car accidents is \(p=36\%=0.36\) and the sample size is \(250\). Because \(np=250\cdot 0.36=90>10\) and \(n(1-p)=250-90=160>10\), the sample size is considered large enough. By the Central Limit Theorem, the sample proportion \(\hat{P}\) is approximately normally distributed with the mean \(\mu_{\hat{P}}=p=0.36\) and standard deviation \(\sigma_{\hat{P}}=\sqrt{\frac{p(1-p)}{n}}\approx 0.03\)
Then the probability of a random sample of 250 car accidents with the injury rate between 30% and 45% is
\(\textstyle P(0.30<\hat{P}<0.45)=P(\hat{P}<0.45)-P(\hat{P}<0.30)=\) =NORM.DIST(30%, 36%,0.03, TRUE)-NORM.DIST(45%, 36%,0.03, TRUE)
\(\approx 0.976\)
The numerical population of grade point averages at a college has mean 2.61 and standard deviation 0.5. If a random sample of size 100 is taken from the population, what is the probability that the sample mean will be between 2.51 and 2.71?
More Practice
A population has mean 73.5 and standard deviation 2.5.
A normally distributed population has mean 57.7 and standard deviation 12.1.
Suppose the mean amount of cholesterol in eggs labeled “large” is 186 milligrams, with standard deviation 7 milligrams. Find the probability that the mean amount of cholesterol in a sample of 144 eggs will be within 2 milligrams of the population mean.
Suppose that 8% of all males suffer some form of color blindness. Find the probability that in a random sample of 250 men at least 10% will suffer some form of color blindness.
In a mayoral election, based on a poll, a newspaper reported that the current mayor received 45% of the vote. If this is true, what is the probability that a random sample of 100 voters had less than 35% voting for the current mayor?
Lab Instructions in Excel
NORM.DIST()
FunctionLet \(X\) be a normal random variable with mean \(\mu\) and standard deviation \(\sigma\), that is \(X\sim \mathcal{N}(\mu, \sigma^2)\). In Excel, \(P(X<x)\) is given by NORM.DIST(x, mean, sd, TRUE)
.
Recall the mean of a data set can obtained by the Excel function AVERAGE()
.
Given the population mean \(\mu\) and standard deviation \(\sigma\), if the sample size \(n\) is bigger than 30 and the sample mean is \(\bar{x}\). The probability of getting another sample of the same size but smaller mean can be obtained by the following Excel function:
NORM.DIST(
\(\bar{x},\mu,\sigma\) /sqrt(n),TRUE)
.
An airline claims that 72% of all its flights to a certain region arrive on time. In a random sample of 30 recent arrivals, 19 were on time. You may assume that the normal distribution applies.