Summarize and interpret the relationship between two qualitative (categorical) variables using two-way tables.
Demonstrate understanding and find conditional, joint and marginal probability from a two-way frequency table.
Create and analyze two-way table to answer probability questions.
As we organize and analyze data from two categorical variables, we make use of two-way tables.
Information in a two-way frequency table:
Values of the two variables are displayed in the left column and the top row.
The body of table consists of frequency counts associated to pairs of values of the two variables.
The right column and the bottom row, which are called margins of the table, consists of row totals and column totals respectively.
A number in a margin are called marginal frequency.
A number in the body of the table is called joint frequency.
The following table summarize responses of a random sample of 1,200 U.S. college students as part of a larger survey.
About Right | Overweight | Underweight | Row Totals | |
---|---|---|---|---|
Female | 560 | 163 | 37 | 760 |
Male | 295 | 72 | 73 | 440 |
Column Totals | 855 | 235 | 110 | 1,200 |
The following table shows joint and marginal probabilities of body image and gender.
About Right | Overweight | Underweight | Row Totals | |
---|---|---|---|---|
Female | \(\frac{560}{1200}=46.67\%\) | \(\frac{163}{1200}=13.58\%\) | \(\frac{37}{1200}=3.08\%\) | \(\frac{760}{1200}=63.33\%\) |
Male | \(\frac{295}{1200}=24.58\%\) | \(\frac{72}{1200}=6.00\%\) | \(\frac{73}{1200}=6.08\%\) | \(\frac{440}{1200}=36.67\%\) |
Column Totals | \(\frac{855}{1200}=71.25\%\) | \(\frac{235}{1200}=19.58\%\) | \(\frac{110}{1200}=9.17\%\) | \(\frac{1200}{1200}=100.00\%\) |
The following table shows probabilities of randomly select male or female who has a certain body image.
About Right | Overweight | Underweight | Row Totals | |
---|---|---|---|---|
Female | \(\frac{560}{760}=73.68\%\) | \(\frac{163}{760}=21.45\%\) | \(\frac{37}{760}=4.87\%\) | \(\frac{760}{760}=100.00\%\) |
Male | \(\frac{295}{440}=67.05\%\) | \(\frac{72}{440}=16.36\%\) | \(\frac{73}{440}=16.59\%\) | \(\frac{440}{440}=100.00\%\) |
The following table summarizes the full-time enrollment at a community college.
Arts-Sci | Bus-Econ | Info Tech | Health Science | Graphics Design | Culinary Arts | Row Totals | |
---|---|---|---|---|---|---|---|
Female | 4,660 | 435 | 494 | 421 | 105 | 83 | 6,198 |
Male | 4,334 | 490 | 564 | 223 | 97 | 94 | 5,802 |
Column Totals | 8,994 | 925 | 1,058 | 644 | 202 | 177 | 12,000 |
Solution: \(P(\text{Male})=\dfrac{5802}{12000}\approx 0.4835=48.35\%.\)
Solution: \(P(\text{Info Tech}|\text{Male})=\dfrac{564}{5802}\approx 0.097=9.7\%.\)
Solution: \(P(\text{Male and Info Tech})=\dfrac{564}{12000}= 0.047=4.7\%.\)
Solution: \(P(\text{Male and Info Tech})=\dfrac{564}{12000}=\dfrac{5802}{12000}\cdot \dfrac{564}{5802}=P(\text{Male})\cdot P(\text{Info Tech}|\text{Male}).\)
This table on the right relates the weights and heights of a group of individuals participating in an observational study.
Weight/Height | Tall | Medium | Short |
---|---|---|---|
Obese | 18 | 28 | 14 |
Normal | 20 | 51 | 28 |
Underweight | 12 | 25 | 9 |
To understand association between categorical variables, we may think conversely. How do we test no association?
If the conditional probabilities are nearly equal for all categories, there may be no association between the variables. Conversely, if the conditional probabilities are different enough, we are confidence to say there is an association.
In general, the bigger the differences in the conditional probabilities, the stronger the association between the variables.
Two variables \(X\) and \(Y\) are independent if \(P(X~\text{and}~Y)=P(X)\cdot P(Y)\). Equivalently, \(P(X|Y)=P(X)\) and \(P(Y|X)=P(Y)\).
Is body image related to gender?
About Right | Overweight | Underweight | Row Totals | |
---|---|---|---|---|
Female | 560 | 163 | 37 | 760 |
Male | 295 | 72 | 73 | 440 |
Column Totals | 855 | 235 | 110 | 1,200 |
Solution: Using Excel (stacked bar chart), we may compare side-by-side the conditional body image distributions for females and males
As a result of our analysis, we know that the conditional distributions of body images for males and females are quite different. We can conclude that there is enough difference to believe that those two categorical variables are in fact related.
When calculating the probability of a negative outcome, we often refer to the probability as a risk.
In general, we are interested in determining how much a new treatment reduces the risk compared to a reference risk
The percentage reduction of risk is
$$\text{percentage reduction of risk}=\frac{\text{new treatment risk}-\text{reference risk}}{\text{reference risk}}.$$
Researchers in the Physicians’ Health Study (1989) designed a randomized double-blind experiment to determine whether aspirin reduces the risk of heart attack. Here are the final results.
Heart Attack | No Heart Attack | Row Totals | |
---|---|---|---|
Aspirin | 139 | 10,898 | 11,037 |
Placebo | 239 | 10,795 | 11,034 |
Column Totals | 378 | 21,693 | 22,071 |
Does aspirin lower the risk of having a heart attack?
Solution: We fisrt compute two conditional probabilities: \(P(\text{heart attack}|\text{aspirin})\) and \(P(\text{heart attack}|\text{placebo})\).
The result shows that taking aspirin reduced the risk from 0.022 to 0.013.
The percentage reduction of risk is $$ \frac{P(\text{heart attack}|\text{aspirin})-P(\text{heart attack}|\text{placebo})}{P(\text{heart attack}|\text{placebo})}=\frac{\text{0.013}-\text{0.022}}{\text{0.022}}=\frac{-\text{0.009}}{\text{0.022}}\approx -\text{0.41}. $$
Therefore, we conclude that taking aspirin results in a 41% reduction in risk.
A hypothetical two-way table, also known as a hypothetical 1000 two-way table, is a two-way table constructed from given probability conditions with 1000 as the total frequency. It can be used to answer complex probability questions.
Sometimes, the total frequency can be taken to be 10,000 or a higher power of 10 so that joint frequencies are integers.
A pregnant woman often opts to have an ultrasound to predict the gender of her baby. Assume the following facts are known:
Use the above facts to answer the following questions.
If the examination predicts a girl, how likely the baby will be a girl?
If the examination predicts a boy, how likely the baby will be a boy?
Solution:
Assume that we have ultrasound predictions for 1,000 random babies.
Fact 1 means that \(P(\text{Girl})=48\%\).
Fact 2 means that \(P(\text{Predicted as girl}|\text{Girl})=9 /10\).
Fact 3 means that \(P(\text{Predicted as boy}|\text{Boy})=3 /4\)
Using those facts, we may create a two-way frequency table.
Girl | Boy | Row Totals | |
---|---|---|---|
Predict Girl | \(480\cdot\frac{9}{10}= 432\) | \(520-390 = 130\) | \(432+130=562\) |
Predict Boy | \(480-432=48\) | \(520\cdot\frac34=390\) | \(48+390=438\) |
Column Totals | \(48\%\cdot 1000=480\) | \(1000-480=520\) | \(1,000\) |
If the examination predicts a girl, the probability that the born baby is a girl is $$P(\text{Girl}|\text{predict girl})=\frac{432}{562} \approx 0.769=76.9\%.$$
If the examination predicts a boy, the probability that the born baby is a boy is $$P(\text{Boy} | \text{predict boy}) = \frac{390}{438} \approx 0.890=89\%.$$
The table below is based on a 1988 study of accident records conducted by the Florida State Department of Highway Safety.
Nonfatal | Fatal | Row Totals | |
---|---|---|---|
Seat Belt | 412,368 | 510 | 412,878 |
No Seat Belt | 162,527 | 1,601 | 164,128 |
Column Totals | 574,895 | 2,111 | 577,006 |
Does wearing a seat belt lower the risk of an accident resulting in a fatality?
A large company has instituted a mandatory employee drug screening program. Assume that the drug test used is known to be 99% accurate. That is, if an employee is a drug user, the test will come back positive (“drug detected”) 99% of the time. If an employee is a non-drug user, then the test will come back negative (“no drug detected”) 99% of the time. Assume that 2% of the employees of the company are drug users.
If an employee’s drug test comes back positive, what is the probability that the test is wrong and the employee is in fact a non drug user?
Lab Instruction in Excel
To create a a stacked bar chart of a two-way table
First select the data table.
Look for and click Insert Column or Bar Chart
in the menu Insert
-> Charts
.
In the dropdown menu, choose the third option in 2-D Column (100% Stacked Column
) or the third option 2-D Bar (100% Stacked Bar
).
To switch row/column, in the output graph, right click the row axis or the column axis, and chose the option Select Data...
to make a switch.
The following table summarize results from a study on program selection and gender.
Arts-Sci | Bus-Econ | Info Tech | Health Science | Graphics Design | Culinary Arts | Row Totals | |
---|---|---|---|---|---|---|---|
Female | 4,660 | 435 | 494 | 421 | 105 | 83 | 6,198 |
Male | 4,334 | 490 | 564 | 223 | 97 | 94 | 5,802 |
Column Totals | 8,994 | 925 | 1,058 | 644 | 202 | 177 | 12,000 |
Use Excel to answer the following question about the study.
Is there an association between gender and program selection? Why or why not?
If they are associated, is the association strong or week?