Create and interpret boxplots as a means of summarizing non-symmetric data.
Calculate and explain the purpose of measures of centers (mean, median), variability (standard deviation, interquartile range).
Explain the impact of outliers on summary statistics such as mean, median and standard deviation.
Find the median, quartiles, IQR and outliers (if they exist) of the sample height of 15 trees.
70, 65, 63, 72, 81, 83, 66, 75, 80, 75, 79, 76, 76, 69, 75
Solution:
Sort the data set from small to large.
63, 65, 66, 69, 70, 72, 75, 75, 75, 76, 76, 79, 80, 81, 83
Find the median \(Q_2\): The sample size is 15. The \(Q_2\) is the 8-th value 75 in the sorted data.
Find \(Q_1\) and \(Q_3\): \(Q_1\) is the median of the lower half of values, that is, 4-th value 69. \(Q_3\) is the median of the upper half values, that is, the 4-th to the last value 79.
Find the IQR: \(\text{IQR}=Q_3-Q_1=79-69=10\).
Since \(Q_1-1.5\text{IQR}=69-1.5\cdot 10=54\) and \(Q_3+1.5\text{IQR}=79+1.5 \cdot 10=94\), there is no outlier in this sample.
A box plot shows a “five-number summary” of the data set. It contains a box, two whiskers and dots (for outliers).
To create the boxplot for a distribution,
Create the boxplot for the ages of 32 best actor oscar winners (1970–2001).
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
Solution: You may use Excel to find the five-number summary.
The quartiles are \(Q_2=42.5\), \(Q_1=37.5\), \(Q_3=49.5\).
The interquartile range is \(\text{IQR}=12\), and the bounds for outliers are \(Q_1-1.5\text{IQR}= 19.5\) and \(Q_3+1.5\text{IQR}=67.5\).
The smallest number that is not an outlier is 31. The largest number that is not an outlier is 61. Those two numbers bound the whiskers.
The number 76 is a mild outlier because \(Q_3+1.5\text{IQR}< 76 < Q_3+3\text{IQR}.\)
Sigma notation: in math, we denote the sum of values \(x_1\), \(x_2\), \(\dots\), \(x_n\) of a variable \(x\) by \(\sum\limits_{i=1}^n x_i\) or simply by \(\sum x\).
The population mean is \(\mu= \frac{\sum x}{N}\), where \(N\) is the population size, i.e the number of elements in the population.
The notation \(\mu\) reads as mu.
The sample mean is \(\bar{x}=\frac{\sum{x}}{n}\), where \(n\) is the sample size. The notation \(\bar{x}\) reads as \(x\)–bar.
Find the mean city mpg for a sample of 10 cars.
18, 21, 20, 21, 16, 18, 18, 18, 16, 20
Solution: The mean is
$$\bar{x}=\frac{18+21+20+21+16+18+18+18+16+20}{10}=18.6.$$
The mean mpg of the 10 cars is 18.6 mpg.
In Excel, suppose the data are in the column array A1:A10
, you may use the function =AVERAGE(A1:A10)
to find the mean.
The weighted mean of a set of numbers \(\{x_1, \dots, x_n\}\) with weights \(w_1\), \(w_2\), …, \(w_n\) is defined as $$\frac{\sum w_ix_i}{\sum w_i}.$$
The mean of a frequency table is weighted mean \(\bar{x}=\frac{\sum f x}{n}\), where \(x\) is an element with frequency \(f\) and \(n\) is the sample size.
In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50%, and the final exam counts for 30%. What’s the overall grade of the student who earned 92 on homework, 95 on quizzes, 90 on tests and 93 on the final.
Solution: The overall grade is the weighted mean
$$\frac{\sum w_ix_i}{\sum w_i}=\frac{0.1\cdot 92+0.1\cdot 95+0.5\cdot 90+0.3\cdot 93}{0.1+0.1+0.5+0.3}=91.6.$$
Find the average petal width for a sample of 10 iris followers.
0.2, 2.1, 0.2, 1.7, 2.3, 0.3, 1.2, 0.2, 1.8, 2.3
Find the mean from the dot plot of sepal length for a sample of 10 iris flowers.
Note: To measure the spread, one may also use the mean absolute deviation \(MAD=\dfrac{\sum |x-\bar{x}|}{n}\). However, the standard deviation has better properties in applications.
Find the mean and standard deviation ages of a sample of 32 best actor Oscar winners (1970–2001).
31, 32, 32, 33, 35, 36, 37, 37, 38, 38, 39, 40, 40, 40, 42, 42, 43, 43, 45, 45, 46, 47, 48, 48, 51, 55, 55, 56, 60, 60, 61, 76
Solution:
We use the Excel functions AVERAGE()
and STDEV.S()
to find the mean and sample standard deviation respectively.
The mean is 44.7. The sample standard deviation is 10.3.
A sample of GPAs from ten students random chosen from a college are recorded as follows.
1.90, 3.00, 2.53, 3.71, 2.12, 1.76, 2.71, 1.39, 4.00, 3.33
Find the standard deviation of this sample.
When we increase values in a data set by a fixed number \(c\), the standard deviation of a data set won’t change. However, the mean increases by \(c\) too.
When we multiply values in a data set by a factor \(k\), the mean and the standard deviation both scale by the factor \(k\).
A sample of the highest temperature of 10 days has a standard deviation \(5^\circ\mathrm{C}\) in Celsius.
If we want to know the standard deviation in Fahrenheit, do we need to recalculate using the sample?
What is the standard deviation in Fahrenheit.
When comparing variables that may have different measurement, it is better to use standardized values. Convert a value to a standard value is called standardization which doesn’t change the distribution of the data set.
Let \(x\) be a data value, \(\bar{x}\) the mean, and \(s\) the standard deviation of a data set. The standard value (or \(z\)-score) of \(x\) is defined as $$z = \frac{x-\bar{x}}{s}, \quad\text{equivalently}\quad x=zs+\bar{x}.$$
In Excel, the \(z\)-score can be obtained by =STANDARDIZE(x, mean, sd)
.
Consider a data set of scores on a standardized test with a mean 70 and standard deviation of 15.
Solution:
If a data set has an approximately bell-shaped distribution, then
For any numerical data set, at least \(1−1/k^2\) of the data lie within \(k\) standard deviations of the mean, where \(k\) is any positive whole number that is at least 2.
A population data set with a bell-shaped distribution has mean \(\mu = 6\) and standard deviation \(\sigma = 2\). Find the approximate proportion of observations in the data set that lie:
Solution:
Apply the Empirical Rule, there are 68% of data lie between 6-2=4 and 6+2=8. Since the distribution is symmetric, then 34% of data lie between 4 and 6, and 34% of data lie between 6 and 8. Then there are only 50%-34%=26% of data lie below 4.
A sample data set has mean \(\bar{x}=6\) and standard deviation \(s = 2\). Find the minimum proportion of observations in the data set that must lie between 2 and 10.
Solution:
Apply Chebyshev’s theorem, there are 75% of data are between \(\bar{x}-2s=2\) amd \(\bar{x}+2s=10\).
A sample data set has mean \(\bar{x}=10\) and standard deviation \(s = 3\). Find the minimum proportion of observations in the data set that must lie between 1 and 19.
A teacher decide to curve the final exam by adding 10 points for each student. Which of the following statistic will NOT change:
A. median, B. mean, C. interquartile range, D. standard deviation?
Please explain your conclusion.
Which distribution of data has the SMALLEST standard deviation? Please explain your conclusion.
Lab Instruction in Excel
To find the median, you may use the function MEDIAN()
.
To find quartiles, you may use the function QUARTILE.EXC
.
Note: This excel function uses weighted mean for \(Q_1\) and \(Q_3\) instead of the arithmetic mean we used.
To find the mean, you may use the function AVERAGE()
.
To find the population standard deviation, you may use the function STDEV.P()
.
To find the sample standard deviation, you may use the function STDEV.S()
.
Select your data—either a single data series, or multiple data series.
Click Insert
> Insert Statistic Chart
> Box and Whisker
to create a boxplot.
For more information, see Create a box and whisker chart in Excel 365
Consider the following sample that consists of speeds of 25 cars.
19, 4, 17, 22, 23, 8, 20, 19, 10, 10, 13, 13, 15, 12, 20, 14, 9, 20, 12, 11, 11, 17, 14, 7, 19