Topic 4: Linear Relationship

1 Learning Goals

Summarize and interpret the relationship between two quantitative variables.
Demonstrate understanding of concepts pertaining to linear regression.
Use regression equations to make predictions and understand its limits.

2 Scatterplots

Correlation refers to a relationship between two quantitative variables:
- the independent (or explanatory) variable, usually denoted by $x$.
- the dependent (or response) variable, usually denoted by $y$.
Example: In a study of education attainment and annual salary, the years of education is the explanatory variable and the annual salary is the response variable.
To describe the relationship between two quantitative variables, statisticians use a scatterplot.
In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength.

3 Direction of Linear Relationship

Positive relationship: the response variable (y) increases when the explanatory variable (x) increases.

Negative relationship: the response variable (y) decreases when the explanatory variable (x) increases.

4 Forms of Scatterplots

5 Strength of Relationship

The strength of the relationship is a description of how closely the data follow the form of the relationship.

6 Outliers

Outliers are points that deviate from the pattern of the relationship.

Practice: Match Scatterplots

A: X = month (January = 1), Y = rainfall (inches) in Napa, CA in 2010 (Note: Napa has rain in the winter months and months with little to no rainfall in summer.)

B: X = month (January = 1), Y = average temperature in Boston MA in 2010 (Note: Boston has cold winters and hot summers.)

C: X = year (in five-year increments from 1970), Y = Medicare costs (in $) (Note: the yearly increase in Medicare costs has gotten bigger and bigger over time.)

D: X = average temperature in Boston MA (°F), Y = average temperature in Boston MA (°C) each month in 2010

E: X = chest girth (cm), Y = shoulder girth (cm) for a sample of men

F: X = engine displacement (liters), Y = city miles per gallon for a sample of cars (Note: engine displacement is roughly a measure of engine size. Large engines use more gas.)

Source: Scatterplots 2 of 5 in Concepts of Statistics

7 The Correlation Coefficient

The correlation coefficient $r$ is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables. One definition is the mean product of standard values. $$r=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{\sqrt{\sum\left(\frac{x-\bar{x}}{s_x}\right)^2}\cdot\sqrt{\left(\frac{y-\bar{y}}{s_y}\right)^2}}=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1},$$ where $n$ is the sample size, $x$ is a data value for the explanatory variable, $\bar{x}$ is the mean of the $x$-values, $s_x$ is the standard deviation of the $x$-values, and similarly, for the notations involving $y$.

See the paper Thirteen Ways to Look at the Correlation Coefficient for other definitions.

8 A Few Remarks on Correlation Coefficient

The expression $z=\dfrac{x-\bar{x}}{s_x}$ is known as the standardized variable (or $z$-score) which
- doesn’t depend on the unit of the variable $x$,
- has mean $0$ and standard deviation 1.
In Excel, the correlation coefficient can be calculated using the function CORREL().
Scatterplots with different correlation coefficients.
Rounding Rule: Round to the nearest thousandth for $r$, $m$ and $b$.

9 Geometric Intuition

10 Geometric Conclusion

For the scatterplots in the previous slides, we see that

$r>0$ if all points $(x-\bar{x}, y-\bar{y})$ are in the 1st and the 3rd quadrants.
$r<0$ if all points $(x-\bar{x}, y-\bar{y})$ are in the 2nd and the 4th quadrants.
$r$ is bigger if points are closer to a line.

One idea of using product is from the geometric interpretation of $\mathbf{u}\cdot\mathbf{v}=\lVert \mathbf{u}\rVert\lVert\mathbf{v}\rVert\cos\theta$. There are also other interpretations of $r$.

11 Properties

The correlation coefficient $r$ is between $-1$ and $1$.
The closer the absolute value $|r|$ is to $1$, the stronger the linear relationship is. Conventionally, the relationship is strong if $|r| > 0.8$, moderate if $0.5< |r|\le 0.8$, and weak if $|r|\le 0.5$.
The correlation is symmetric in $x$ and $y$, that is CORREL(x, y)=CORREL(y, x).
The correlation does not change when the units of measurement of either one of the variables change. In other words, if we change the units of measurement of the explanatory variable and/or the response variable, it has no effect on the correlation $r$.

After discussing regression lines, you will see why $r^2\le 1$ and $r$ measures linear relationship or you may read 18.4 - More on Understanding Rho.

12 Limitations and Sensitivity to Outliers

The correlation by itself is not enough to determine whether a relationship is linear. It’s important to graph data set before analyzing it. See Francis Anscombe’s demonstration both the importance of graphing data and the effect of outliers on statistical properties.
The correlation is heavily influenced by outliers. Try the simulation in Linear Relation (4 of 4) in Concepts in Statistics

Practice: Guess the Correlation Coefficient

Source: https://istats.shinyapps.io/guesscorr/

13 Example: The Correlation Coefficient (1 of 2)

Describe the relationship between Midterm 1 and Final for a sample of 10 students with data shown on the right.

Solution: First we create a scatterplot.

Using the Excel function CORREL(x, y), we find the correlation coefficient is $r=0.905$ .

The $r$-value shows a strong positive linear relationship.

Midterm1	Final
72	72
93	88
81	82
82	82
94	88
80	77
73	78
71	77
81	76
81	76
63	68

14 Example: The Correlation Coefficient (2 of 2)

The correlation coefficient $r$ can also be calculation by hand using the formula. $\dfrac{\sum z_xz_y}{n-1}$, where $z_x=\frac{x-\bar{x}}{s_x}$ and $z_y=\frac{y-\bar{y}}{s_y}$.

Midterm1	Final	z_x	z_y	z_xy
72	72	-0.78006	-1.06926	0.834087814
93	88	1.50088	1.544483	2.318083715
81	82	0.197484	0.56433	0.111446332
82	82	0.306101	0.56433	0.172741815
94	88	1.609497	1.544483	2.485839773
80	77	0.088868	-0.25246	-0.02243591
73	78	-0.67145	-0.0891	0.059829084
71	77	-0.88868	-0.25246	0.224359064
81	76	0.197484	-0.41582	-0.08211835
81	76	0.197484	-0.41582	-0.08211835
63	68	-1.75761	-1.72269	3.027820885
79.18182	78.54545	<- mean	sum ->	9.047535876
9.206717	6.121497	<- stdev.s	correl ->	0.904753588

Practice: Calculate Correlation Coefficient

15 Correlation vs Causation

Correlation is described by data from observational study. Observational studies cannot prove cause and effect which requires controlled study and rigorous inferences.
Correlation may be used to make a prediction which is probabilistic.
In a linear relationship, an $r$-value that is close to 1 or -1 is insufficient to claim that the explanatory variable causes changes in the response variable. The correct interpretation is that there is a statistical relationship between the variables.
A lurking variable is a variable that is not measured in the study, but affects the interpretation of the relationship between the explanatory and response variables.

16 Example: Correlation vs Causation (1 of 2)

The scatterplot below shows the relationship between the number of firefighters sent to fires (x) and the amount of damage caused by fires (y) in a certain city.

Can we conclude that the increase in firefighters causes the increase in damage?

Source: Causation and Lurking Variables in Concepts in Statistics for more example

17 Example: Correlation vs Causation (2 of 2)

Solution:

Correlation: The more fire fighters, the more likely there is bigger damage. However the fire fighters do not cause the fire.
Prediction: You could predict the amount of damage by looking at the number of fire fighters present.
Causation: The fire fighters are unlikely the cause of the fire.
Lurking variable: The seriousness of the fire is a lurking variable.

Practice: Lurking Variable

18 The Regression Line

The line that best summarizes a linear relationship is the least squares regression line, that is, the line with the smallest sum of squares of the errors.
For a value of the explanatory variable $x$, the corresponding $y$-values, denoted as $\hat{y}$, on the least-squares regression line can be used to predict the real $y$-value.
The regression line is unique and passes though $(\bar{x}, \bar{y})$. An equation is given by $$\hat{y}=m(x-\bar{x})+\bar{y}=m x+b,$$ where the slope is $m=\frac{\sum(x-\bar{x})(y-\bar{y})}{\sum(x-\bar{x})^2}=r\frac{s_y}{s_x}$ and the $y$-intercept is $b=\bar{y}-m\bar{x}.$
The residual is $\text{Residual}=\text{Observed}-\text{Predicted}=y-\hat{y}.$
A prediction beyond the range of the data is called extrapolation.

19 Example: Old Faithful Geyser (1 of 2)

The following sample is taken from data about the Old Faithful geyser.

Study the linear relationship.
Find an equation of the regression line.
Find the predicated value and the residual when the eruption time is 1.8 minutes.

eruptions	waiting	eruptions	waiting
3.917	84	1.75	62
4.200	78	4.80	84
1.750	47	1.60	52
4.700	83	4.25	79
2.167	52	1.80	51

20 Example: Old Faithful Geyser (2 of 2)

Solution: The Scatterplot shows a linear relationship.

The slope of the regression line can be obtained using the Excel function SLOPE(). In this example, $m= 10.836$.

The $y$-intercept $(0,b)$ can be obtained using the Excel function INTERCEPT(). In this example, $b= 33.68$.

An equation of the line is then $\hat{y}=10.836x + 33.68$.

When $x=1.8$, we have $\hat{y}=10.836*1.8 + 33.68= 53.1848$.

The residual is $y-\hat{y}=51-53.1848= -2.1848$. That means the predication over-estimates the waiting time about 2.18 minutes.

Practice: Find Regression Line

21 Residual Plot

A residual is an observable estimate of the unobservable statistical error.

Positive and Negative Residuals
Residual plots, a scatterplot of the $(x, \text{residual})$ can be used if a linear model is appropriate. A random pattern (or no obvious pattern) indicates a good fit of a linear model. See Assessing the Fit of a Line (2 of 4) in Concepts in Statistics for examples.

Image source: Figure 5.14 in Introduction to Statistics and Data Analysis

22 Standard Error

The residual standard error (or standard error of the regression), calculated by the Excel function STEYX(), is $$s_e=\sqrt{\dfrac{SS_{res}}{n-2}},$$ where $SS_{res}=\sum (y-\hat{y})^2$ is the sum of square residuals.
The standard error is a typical (average) amount that an observation deviates from the least-squares line.
The smaller $s_e$ is, the more accurate the prediction is.

23 Coefficient of Determination

Coefficient of determination measures the proportion of variability in the response variable $y$ that can be attributed to the linear regression line (a nice explanation can be found in Explaining The Variance of a Regression Model).

The total variance of $y$ is the sum of square deviations $SS_{tot}=\sum(y-\bar{y})^2=(n-1)s_y^2$.
The total variance of predicted $y$ is $SS_{reg}=\sum(\hat{y}-\bar{y})^2=(n-1)r^2s_y^2$.
The coefficient of determination is $$\dfrac{SS_{reg}}{SS_{tot}}=\dfrac{(n-1)r^2s_y^2}{(n-1)s_y^2}=r^2.$$

A visualization by Magnusson, K. can be found at https://rpsychologist.com/correlation/.

24 Remarks on Coefficient of Determination

The $r$ in the coefficient of determination is the correlation coefficient. Equivalently, $r=\pm\sqrt{r^2}$.
The smaller the standard error, the larger the coefficient of determination: $r^2=1-\dfrac{SS_{res}}{SS_{tot}}=1-\dfrac{(n-2)s_e^2}{SS_{tot}}$.
$n−2$ is the degrees of freedom. We lose two degrees of freedom because both the slope and the $y$-intercept are estimated.

25 Example: $s_e$ and $r^2$

Find the standard error and coefficient of determination for the data of midterm1 and final.