class: center, middle, inverse, title-slide .title[ # Lesson 6: Characterizing Data ] .author[ ### Fei Ye ] .date[ ### May, 2024 ] --- class: center middle
## Unit 6A: Characterizing Data *Primary Source:* PPT for the book "Using & Understanding Mathematics". --- ## Definition The **distribution of a variable** refers to the way its values are spread over all possible values. We can display a distribution with a table or graph, for example, the following shows the distribution of Oscar-winning female actors .pull-left[ <table class="table" style="font-size: 18px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> Age </th> <th style="text-align:center;"> Number of Actresses </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;"> 20-29 </td> <td style="text-align:center;"> 32 </td> </tr> <tr> <td style="text-align:center;"> 30-39 </td> <td style="text-align:center;"> 34 </td> </tr> <tr> <td style="text-align:center;"> 40-49 </td> <td style="text-align:center;"> 14 </td> </tr> <tr> <td style="text-align:center;"> 50-59 </td> <td style="text-align:center;"> 2 </td> </tr> <tr> <td style="text-align:center;"> 60-69 </td> <td style="text-align:center;"> 6 </td> </tr> <tr> <td style="text-align:center;"> 70-79 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:center;"> 80-89 </td> <td style="text-align:center;"> 1 </td> </tr> <tr> <td style="text-align:center;"> Total </td> <td style="text-align:center;"> 90 </td> </tr> </tbody> </table> ] .pull-right[ .center[ ![:resize , 80%](data:image/png;base64,#img/image-20200818141120013.png) ] ] --- ## Measures of Center in a Distribution - The **mean** is what we most commonly call the average value. It is defined as follows: $$ \text{Mean}=\frac{\text{Sum of All Values}}{\text{Total Number of Values}} $$ - The **median** is the middle value in the .big[sorted data] set (or halfway between the two middle values if the number of values is even). - The **mode** is the most common value (or group of values) in a distribution. --- ## Example: Price Data (1 of 2) Eight grocery stores sell the PR energy bar for the following prices: $1.09, $1.29, $1.29, $1.35, $1.39, $1.49, $1.59, $1.79. Find the mean, median, and mode for these prices. **Solution:** By the definition of the mean $$ `\begin{aligned} \text{Mean} =&\frac{\$1.09+\$1.29+\$1.29+\$1.35+\$1.39+\$1.49+\$1.59+\$1.79}{8}\\ =&\$1.41 \end{aligned}` $$ The mean price is $1.41 --- ## Example: Price Data (2 of 2) To find the Median: Step 1: sort the data in ascending order: .center[$1.09 $1.29 $1.29 $1.35 $1.39 $1.49 $1.59 $1.79.] Step 2: find the value that will separate the sorted data in to two equal halves. Because there are eight prices (an even number), there are two values in the middle of the list: $1.35 and $1.39. The median lies halfway between these two values, which we calculate by adding them and dividing by 2: $$\text{median}=\frac{\$1.35+\$1.39}{2}=\$1.37.~~$$ The mode is $1.29 which appears twice. Each of others appears only once. --- ## Practice: Mean, Median and Mode <iframe src="https://www.myopenmath.com/embedq2.php?id=750372&seed=2024&showansafter&allowregen" width="100%" height="480px" data-external="1"></iframe> --- ## Weighted Mean Given a data set, we may associate to each data value a number, called the **weight**. The **weighted sum** is the sum of the product of the weight and the data value. The **total weight** is the sum of weights. The quotient of weighted sum by total weight is called the weighted mean: $$ \text{Weighted Mean}=\frac{\text{Weighted Sum}}{\text{Total Weight}} $$ --- ## Example: Course Overall Grade In a course, the overall grade is determined in the following way: the homework average counts for 10%, the quiz average counts for 10%, the test average counts 50% , and the final exam counts for 30%. What's the overall grade of the student who earned 92 on homework, 95 on quizzes, 90 on tests and 93 on the final. -- **Solution:** The overall grade is the weighted mean `$$\text{Overall}=\frac{0.1\cdot 92+0.1\cdot 95+0.5\cdot 90+0.3\cdot 93}{0.1+0.1+0.5+0.3}=91.6.$$` --- ## Practice: Overal Grade <iframe src="https://www.myopenmath.com/embedq2.php?id=687404&seed=2024&showansafter&allowregen" width="100%" height="480px" data-external="1"></iframe> --- ## Effects of Outliers An **outlier** in a data set is a data value that is much higher or much lower than almost all other values. An outlier can change the mean of the data but generally does not affect the median or mode. For example, consider the following data set of contract offers: .center[$0 $0 $0 $0 $100,000] The mean contract offer is $25,000. As displayed, the outlier $100,000 can pull the mean upward (or downward). The median and mode of the data are not affected. --- ## Example: Wage Dispute (1 of 2) A newspaper surveys wages for assembly workers in regional high-tech companies and reports an average of $22 per hour. The workers at one large firm immediately request a pay raise, claiming that they work as hard as employees at other companies but their average wage is only $19. The management rejects their request, telling them that they are overpaid because their average wage, in fact, is $23. Can both sides be right? Explain. --- ## Example: Wage Dispute (2 of 2) **Solution:** Both sides can be right if they are using different definitions of average. In this case, the workers may be using the median while management uses the mean. For example, imagine that there are only five workers at the firm and their wages are $19, $19, $19, $19, and $39. The median of these five wages is $19 (as the workers claimed), but the mean is $23 (as management claimed). --- ## Shapes of Distributions Two single-peaked (unimodal) distributions ![image-20200818143223330](data:image/png;base64,#img/image-20200818143223330.png) A double-peaked ( bimodal ) distribution ![image-20200818143234507](data:image/png;base64,#img/image-20200818143234507.png) --- ## Symmetry A distribution is **symmetric** if its left half is a mirror image of its right half. A distribution that is not symmetric must have values that tend to be more spread out on one side than the other. In this case we say the distribution is **skewed**. ![image-20200818143250826](data:image/png;base64,#img/image-20200818143250826.png) --- ## Skewness .pull-left[ A distribution is **left-skewed** if its values are more spread out on the left side. ![image-20200818143324434](data:image/png;base64,#img/image-20200818143324434.png) ] .pull-right[ A distribution is **right-skewed** if its values are more spread out on the right side. ![image-20200818143332963](data:image/png;base64,#img/image-20200818143332963.png) ] --- ## Symmetry and Skewness - Definition - A single-peaked distribution is symmetric if its left half is a mirror image of its right half. - A single-peaked distribution is left-skewed if its values are more spread out on the left side of the mode. - A single-peaked distribution is right-skewed if its values are more spread out on the right side of the mode. --- ## Example: Skewness (1 of 3) For each of the following situations, state whether you expect the distribution to be symmetric, left-skewed, or right-skewed. Explain. a. Heights of a sample of 100 women b. Number of books read during the school year by fifth graders c. Speeds of cars on a road where a visible patrol car is using radar to detect speeders. --- ## Example: Skewness (2 of 3) **Solution:** a. The distribution of heights of women is symmetric, because roughly equal numbers of women are shorter and taller than the mean and extremes of height are rare on either side of the mean. b. The distribution of the number of books read is right-skewed. Most fifth-grade children read a moderate number of books during the school year, but a few voracious readers will read far more than most other students. These students will therefore be outliers with high values for the number of books read, creating a tail on the right side of the distribution. --- ## Example: Skewness (3 of 3) c. Drivers usually slow down when they are aware of a patrol car looking for speeders. Few if any drivers will be exceeding the speed limit, but some drivers slow to well below the speed limit. The distribution of speeds is therefore left-skewed, with a mode near the speed limit but a few cars going well below the speed limit. --- ## Practice: Identify Shapes <iframe src="https://www.myopenmath.com/embedq2.php?id=734580&seed=2024&showansafter&allowregen" width="100%" height="480px" data-external="1"></iframe> --- ## Variation From left to right, these three distributions have increasing variation. **Variation** describes how widely data values are spread out about the center of a data set. ![image-20200818143352817](data:image/png;base64,#img/image-20200818143352817.png) --- ## Example: Variation in Marathon Times How would you expect the symmetry and variation to differ between times in the Olympic marathon and times in the New York marathon? Explain. **Solution:** The Olympic marathon invites only elite runners, whose times are likely to be clustered not far above world record times. The New York marathon allows runners of all abilities, whose times are spread over a very wide range. Therefore, the variation among the times should be greater in the New York marathon than in the Olympic marathon. --- ## Practice: The average of quizzes Find the mean, the median and mode of the following quiz scores of 10 students. .center[ 5, 9, 5, 8, 9, 8, 8, 7, 8, 7 ] --- ## Practice: Overall grade A course has the following assessment components and weights: homework, 30%; Lab, 10%; Test, 35%; and Final, 25%. A student earned the following grades: Homework, 100; Quiz 85; Test, 90; and Final 93. Find the student's overall grade. --- ## Practice: Is the quiz hard or easy The following are quiz scores of 10 randomly selected students. .center[0, 0, 0, 0, 7, 9, 9, 9, 10, 10 ] Do you think the quiz is easy or hard? --- ## Practice: Shapes of distributions For each of the following situations, state whether you expect the distribution to be unimodal, or multimodal. If it is unimodal, determine if it is symmetric, left-skewed, or right-skewed. a. The scores of a easy quiz. b. The height of trees in a forest. c. The scores of a hard test. d. The number of people taking public transformation in every hour from 6am to 10pm. --- ## Practice: Variation How would you expect the symmetry and variation to differ between the department uniform final grades of one class and all classes. Please explain.