Topic 4: Linear Relationship

Fei Ye

November 2024

1 Learning Goals


2 Scatterplots


3 Direction of Linear Relationship

  • Positive relationship: the response variable (y) increases when the explanatory variable (x) increases.
  • Negative relationship: the response variable (y) decreases when the explanatory variable (x) increases.

4 Forms of Scatterplots


5 Strength of Relationship

The strength of the relationship is a description of how closely the data follow the form of the relationship.


6 Outliers

Outliers are points that deviate from the pattern of the relationship.


Practice: Match Scatterplots

A: X = month (January = 1), Y = rainfall (inches) in Napa, CA in 2010 (Note: Napa has rain in the winter months and months with little to no rainfall in summer.)

B: X = month (January = 1), Y = average temperature in Boston MA in 2010 (Note: Boston has cold winters and hot summers.)

C: X = year (in five-year increments from 1970), Y = Medicare costs (in $) (Note: the yearly increase in Medicare costs has gotten bigger and bigger over time.)

D: X = average temperature in Boston MA (°F), Y = average temperature in Boston MA (°C) each month in 2010

E: X = chest girth (cm), Y = shoulder girth (cm) for a sample of men

F: X = engine displacement (liters), Y = city miles per gallon for a sample of cars (Note: engine displacement is roughly a measure of engine size. Large engines use more gas.)

Source: Scatterplots 2 of 5 in Concepts of Statistics


7 The Correlation Coefficient

The correlation coefficient \(r\) is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables. One definition is the mean product of standard values. $$r=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{\sqrt{\sum\left(\frac{x-\bar{x}}{s_x}\right)^2}\cdot\sqrt{\left(\frac{y-\bar{y}}{s_y}\right)^2}}=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1},$$ where \(n\) is the sample size, \(x\) is a data value for the explanatory variable, \(\bar{x}\) is the mean of the \(x\)-values, \(s_x\) is the standard deviation of the \(x\)-values, and similarly, for the notations involving \(y\).

See the paper Thirteen Ways to Look at the Correlation Coefficient for other definitions.


8 A Few Remarks on Correlation Coefficient


9 Geometric Intuition


10 Geometric Conclusion

For the scatterplots in the previous slides, we see that

One idea of using product is from the geometric interpretation of \(\mathbf{u}\cdot\mathbf{v}=\lVert \mathbf{u}\rVert\lVert\mathbf{v}\rVert\cos\theta\). There are also other interpretations of \(r\).


11 Properties

After discussing regression lines, you will see why \(r^2\le 1\) and \(r\) measures linear relationship or you may read 18.4 - More on Understanding Rho.


12 Limitations and Sensitivity to Outliers


Practice: Guess the Correlation Coefficient

Source: https://istats.shinyapps.io/guesscorr/


13 Example: The Correlation Coefficient (1 of 2)

Describe the relationship between Midterm 1 and Final for a sample of 10 students with data shown on the right.

Solution: First we create a scatterplot.

Using the Excel function CORREL(x, y), we find the correlation coefficient is \(r=0.905\) .

The \(r\)-value shows a strong positive linear relationship.

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68

14 Example: The Correlation Coefficient (2 of 2)

The correlation coefficient \(r\) can also be calculation by hand using the formula. \(\dfrac{\sum z_xz_y}{n-1}\), where \(z_x=\frac{x-\bar{x}}{s_x}\) and \(z_y=\frac{y-\bar{y}}{s_y}\).

Midterm1 Final z_x z_y z_xy
72 72 -0.78006 -1.06926 0.834087814
93 88 1.50088 1.544483 2.318083715
81 82 0.197484 0.56433 0.111446332
82 82 0.306101 0.56433 0.172741815
94 88 1.609497 1.544483 2.485839773
80 77 0.088868 -0.25246 -0.02243591
73 78 -0.67145 -0.0891 0.059829084
71 77 -0.88868 -0.25246 0.224359064
81 76 0.197484 -0.41582 -0.08211835
81 76 0.197484 -0.41582 -0.08211835
63 68 -1.75761 -1.72269 3.027820885
79.18182 78.54545 <- mean sum -> 9.047535876
9.206717 6.121497 <- stdev.s correl -> 0.904753588

Practice: Calculate Correlation Coefficient


15 Correlation vs Causation


16 Example: Correlation vs Causation (1 of 2)

The scatterplot below shows the relationship between the number of firefighters sent to fires (x) and the amount of damage caused by fires (y) in a certain city.

Can we conclude that the increase in firefighters causes the increase in damage?

Source: Causation and Lurking Variables in Concepts in Statistics for more example


17 Example: Correlation vs Causation (2 of 2)

Solution:

  1. Correlation: The more fire fighters, the more likely there is bigger damage. However the fire fighters do not cause the fire.

  2. Prediction: You could predict the amount of damage by looking at the number of fire fighters present.

  3. Causation: The fire fighters are unlikely the cause of the fire.

  4. Lurking variable: The seriousness of the fire is a lurking variable.


Practice: Lurking Variable


18 The Regression Line


19 Example: Old Faithful Geyser (1 of 2)

The following sample is taken from data about the Old Faithful geyser.

  1. Study the linear relationship.
  2. Find an equation of the regression line.
  3. Find the predicated value and the residual when the eruption time is 1.8 minutes.
eruptions waiting eruptions waiting
3.917 84 1.75 62
4.200 78 4.80 84
1.750 47 1.60 52
4.700 83 4.25 79
2.167 52 1.80 51

20 Example: Old Faithful Geyser (2 of 2)

Solution: The Scatterplot shows a linear relationship.

The slope of the regression line can be obtained using the Excel function SLOPE(). In this example, \(m= 10.836\).

The \(y\)-intercept \((0,b)\) can be obtained using the Excel function INTERCEPT(). In this example, \(b= 33.68\).

An equation of the line is then \(\hat{y}=10.836x + 33.68\).

When \(x=1.8\), we have \(\hat{y}=10.836*1.8 + 33.68= 53.1848\).

The residual is \(y-\hat{y}=51-53.1848= -2.1848\). That means the predication over-estimates the waiting time about 2.18 minutes.


Practice: Find Regression Line


21 Residual Plot

Image source: Figure 5.14 in Introduction to Statistics and Data Analysis


22 Standard Error


23 Coefficient of Determination

Coefficient of determination measures the proportion of variability in the response variable \(y\) that can be attributed to the linear regression line (a nice explanation can be found in Explaining The Variance of a Regression Model).

A visualization by Magnusson, K. can be found at https://rpsychologist.com/correlation/.


24 Remarks on Coefficient of Determination


25 Example: \(s_e\) and \(r^2\)

Find the standard error and coefficient of determination for the data of midterm1 and final.

Solution:

In Excel, the function STEYX() can be used to obtain the residual standard error. In this example, \(s_e\approx 2.748\).

The correlation coefficient is \(0.905\).

The coefficient determination is \(r^2=0.905^2\approx 0.819\).

Midterm1 Final
72 72
93 88
81 82
82 82
94 88
80 77
73 78
71 77
81 76
81 76
63 68

Practice: Analyzing Linear Relationship


References


Lab Instruction in Excel


26 Scatter Plots and Correlation Coefficient


27 Slope, Intercept, \(r^2\) and \(s_e\)