Summarize and interpret the relationship between two quantitative variables.
Demonstrate understanding of concepts pertaining to linear regression.
Use regression equations to make predictions and understand its limits.
Correlation refers to a relationship between two quantitative variables:
the independent (or explanatory) variable, usually denoted by \(x\).
the dependent (or response) variable, usually denoted by \(y\).
Example: In a study of education attainment and annual salary, the years of education is the explanatory variable and the annual salary is the response variable.
To describe the relationship between two quantitative variables, statisticians use a scatterplot.
In a scatterplot, we describe the overall pattern with descriptions of direction, form, and strength.
The strength of the relationship is a description of how closely the data follow the form of the relationship.
Outliers are points that deviate from the pattern of the relationship.
A: X = month (January = 1), Y = rainfall (inches) in Napa, CA in 2010 (Note: Napa has rain in the winter months and months with little to no rainfall in summer.)
B: X = month (January = 1), Y = average temperature in Boston MA in 2010 (Note: Boston has cold winters and hot summers.)
C: X = year (in five-year increments from 1970), Y = Medicare costs (in $) (Note: the yearly increase in Medicare costs has gotten bigger and bigger over time.)
D: X = average temperature in Boston MA (°F), Y = average temperature in Boston MA (°C) each month in 2010
E: X = chest girth (cm), Y = shoulder girth (cm) for a sample of men
F: X = engine displacement (liters), Y = city miles per gallon for a sample of cars (Note: engine displacement is roughly a measure of engine size. Large engines use more gas.)
The correlation coefficient \(r\) is a numeric measure that measures the strength and direction of a linear relationship between two quantitative variables. One definition is the mean product of standard values. $$r=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{\sqrt{\sum\left(\frac{x-\bar{x}}{s_x}\right)^2}\cdot\sqrt{\left(\frac{y-\bar{y}}{s_y}\right)^2}}=\dfrac{\sum\left(\frac{x-\bar{x}}{s_x}\right)\left(\frac{y-\bar{y}}{s_y}\right)}{n-1},$$ where \(n\) is the sample size, \(x\) is a data value for the explanatory variable, \(\bar{x}\) is the mean of the \(x\)-values, \(s_x\) is the standard deviation of the \(x\)-values, and similarly, for the notations involving \(y\).
See the paper Thirteen Ways to Look at the Correlation Coefficient for other definitions.
The expression \(z=\dfrac{x-\bar{x}}{s_x}\) is known as the standardized variable (or \(z\)-score) which
doesn’t depend on the unit of the variable \(x\),
has mean \(0\) and standard deviation 1.
In Excel, the correlation coefficient can be calculated using the function CORREL()
.
Rounding Rule: Round to the nearest thousandth for \(r\), \(m\) and \(b\).
For the scatterplots in the previous slides, we see that
\(r>0\) if all points \((x-\bar{x}, y-\bar{y})\) are in the 1st and the 3rd quadrants.
\(r<0\) if all points \((x-\bar{x}, y-\bar{y})\) are in the 2nd and the 4th quadrants.
\(r\) is bigger if points are closer to a line.
One idea of using product is from the geometric interpretation of \(\mathbf{u}\cdot\mathbf{v}=\lVert \mathbf{u}\rVert\lVert\mathbf{v}\rVert\cos\theta\). There are also other interpretations of \(r\).
The correlation coefficient \(r\) is between \(-1\) and \(1\).
The closer the absolute value \(|r|\) is to \(1\), the stronger the linear relationship is. Conventionally, the relationship is strong if \(|r| > 0.8\), moderate if \(0.5< |r|\le 0.8\), and weak if \(|r|\le 0.5\).
The correlation is symmetric in \(x\) and \(y\), that is CORREL(x, y)=CORREL(y, x)
.
The correlation does not change when the units of measurement of either one of the variables change. In other words, if we change the units of measurement of the explanatory variable and/or the response variable, it has no effect on the correlation \(r\).
After discussing regression lines, you will see why \(r^2\le 1\) and \(r\) measures linear relationship or you may read 18.4 - More on Understanding Rho.
The correlation by itself is not enough to determine whether a relationship is linear. It’s important to graph data set before analyzing it. See Francis Anscombe’s demonstration both the importance of graphing data and the effect of outliers on statistical properties.
The correlation is heavily influenced by outliers. Try the simulation in Linear Relation (4 of 4) in Concepts in Statistics
Describe the relationship between Midterm 1 and Final for a sample of 10 students with data shown on the right.
Solution: First we create a scatterplot.
Using the Excel function CORREL(x, y)
, we find the correlation coefficient is
\(r=0.905\) .
The \(r\)-value shows a strong positive linear relationship.
Midterm1 | Final |
---|---|
72 | 72 |
93 | 88 |
81 | 82 |
82 | 82 |
94 | 88 |
80 | 77 |
73 | 78 |
71 | 77 |
81 | 76 |
81 | 76 |
63 | 68 |
The correlation coefficient \(r\) can also be calculation by hand using the formula. \(\dfrac{\sum z_xz_y}{n-1}\), where \(z_x=\frac{x-\bar{x}}{s_x}\) and \(z_y=\frac{y-\bar{y}}{s_y}\).
Midterm1 | Final | z_x | z_y | z_xy |
---|---|---|---|---|
72 | 72 | -0.78006 | -1.06926 | 0.834087814 |
93 | 88 | 1.50088 | 1.544483 | 2.318083715 |
81 | 82 | 0.197484 | 0.56433 | 0.111446332 |
82 | 82 | 0.306101 | 0.56433 | 0.172741815 |
94 | 88 | 1.609497 | 1.544483 | 2.485839773 |
80 | 77 | 0.088868 | -0.25246 | -0.02243591 |
73 | 78 | -0.67145 | -0.0891 | 0.059829084 |
71 | 77 | -0.88868 | -0.25246 | 0.224359064 |
81 | 76 | 0.197484 | -0.41582 | -0.08211835 |
81 | 76 | 0.197484 | -0.41582 | -0.08211835 |
63 | 68 | -1.75761 | -1.72269 | 3.027820885 |
79.18182 | 78.54545 | <- mean | sum -> | 9.047535876 |
9.206717 | 6.121497 | <- stdev.s | correl -> | 0.904753588 |
Correlation is described by data from observational study. Observational studies cannot prove cause and effect which requires controlled study and rigorous inferences.
Correlation may be used to make a prediction which is probabilistic.
In a linear relationship, an \(r\)-value that is close to 1 or -1 is insufficient to claim that the explanatory variable causes changes in the response variable. The correct interpretation is that there is a statistical relationship between the variables.
A lurking variable is a variable that is not measured in the study, but affects the interpretation of the relationship between the explanatory and response variables.
The scatterplot below shows the relationship between the number of firefighters sent to fires (x) and the amount of damage caused by fires (y) in a certain city.
Can we conclude that the increase in firefighters causes the increase in damage?
Solution:
Correlation: The more fire fighters, the more likely there is bigger damage. However the fire fighters do not cause the fire.
Prediction: You could predict the amount of damage by looking at the number of fire fighters present.
Causation: The fire fighters are unlikely the cause of the fire.
Lurking variable: The seriousness of the fire is a lurking variable.
The following sample is taken from data about the Old Faithful geyser.
eruptions | waiting | eruptions | waiting |
---|---|---|---|
3.917 | 84 | 1.75 | 62 |
4.200 | 78 | 4.80 | 84 |
1.750 | 47 | 1.60 | 52 |
4.700 | 83 | 4.25 | 79 |
2.167 | 52 | 1.80 | 51 |
Solution: The Scatterplot shows a linear relationship.
The slope of the regression line can be obtained using the Excel function SLOPE()
. In this example, \(m= 10.836\).
The \(y\)-intercept \((0,b)\) can be obtained using the Excel function INTERCEPT()
. In this example, \(b= 33.68\).
An equation of the line is then \(\hat{y}=10.836x + 33.68\).
When \(x=1.8\), we have \(\hat{y}=10.836*1.8 + 33.68= 53.1848\).
The residual is \(y-\hat{y}=51-53.1848= -2.1848\). That means the predication over-estimates the waiting time about 2.18 minutes.
A residual is an observable estimate of the unobservable statistical error.
Positive and Negative Residuals
Residual plots, a scatterplot of the \((x, \text{residual})\) can be used if a linear model is appropriate. A random pattern (or no obvious pattern) indicates a good fit of a linear model. See Assessing the Fit of a Line (2 of 4) in Concepts in Statistics for examples.
The residual standard error (or standard error of the regression), calculated by the Excel function STEYX()
, is
$$s_e=\sqrt{\dfrac{SS_{res}}{n-2}},$$
where \(SS_{res}=\sum (y-\hat{y})^2\) is the sum of square residuals.
The standard error is a typical (average) amount that an observation deviates from the least-squares line.
The smaller \(s_e\) is, the more accurate the prediction is.
Coefficient of determination measures the proportion of variability in the response variable \(y\) that can be attributed to the linear regression line (a nice explanation can be found in Explaining The Variance of a Regression Model).
The total variance of \(y\) is the sum of square deviations \(SS_{tot}=\sum(y-\bar{y})^2=(n-1)s_y^2\).
The total variance of predicted \(y\) is \(SS_{reg}=\sum(\hat{y}-\bar{y})^2=(n-1)r^2s_y^2\).
The coefficient of determination is $$\dfrac{SS_{reg}}{SS_{tot}}=\dfrac{(n-1)r^2s_y^2}{(n-1)s_y^2}=r^2.$$
A visualization by Magnusson, K. can be found at https://rpsychologist.com/correlation/.
The \(r\) in the coefficient of determination is the correlation coefficient. Equivalently, \(r=\pm\sqrt{r^2}\).
The smaller the standard error, the larger the coefficient of determination: \(r^2=1-\dfrac{SS_{res}}{SS_{tot}}=1-\dfrac{(n-2)s_e^2}{SS_{tot}}\).
\(n−2\) is the degrees of freedom. We lose two degrees of freedom because both the slope and the \(y\)-intercept are estimated.
Find the standard error and coefficient of determination for the data of midterm1 and final.
Solution:
In Excel, the function STEYX()
can be used to obtain the residual standard error. In this example, \(s_e\approx 2.748\).
The correlation coefficient is \(0.905\).
The coefficient determination is \(r^2=0.905^2\approx 0.819\).
Midterm1 | Final |
---|---|
72 | 72 |
93 | 88 |
81 | 82 |
82 | 82 |
94 | 88 |
80 | 77 |
73 | 78 |
71 | 77 |
81 | 76 |
81 | 76 |
63 | 68 |
Magnusson, K. (2023). Interpreting Correlations: An interactive visualization (Version 0.7.1) [Web App]. R Psychologist. https://rpsychologist.com/correlation/.
PennState STAT414 Lesson 18: The Correlation Coefficient https://online.stat.psu.edu/stat414/lesson/18.
Wiki Correlation https://en.wikipedia.org/wiki/Correlation.
Wiki Covariance https://en.wikipedia.org/wiki/Covariance#Discrete_random_variables
Joseph Lee Rodgers; W. Alan Nicewander. Thirteen Ways to Look at the Correlation Coefficient https://www.stat.berkeley.edu/~rabbee/correlation.pdf.
Matthew Gunn (https://stats.stackexchange.com/users/97925/matthew-gunn), How can the regression error term ever be correlated with the explanatory variables?
Lab Instruction in Excel
To create a scatter plot, first select the data sets, and then look for Insert Scatter(X, Y)
in the menu Insert
-> Charts
.
The correlation coefficient \(r\) can be calculated by the Excel function correl()
.
The slope of a linear regression can be calculated by the Excel function SLOPE()
.
The \(y\)-intercept of a linear regression can be calculated by the Excel function INTERCEPT()
.
The coefficient of determination \(r^2\) can be calculated by first finding \(r\), then applying the formula r^2
.
The standard error \(s_e\) of the regression (residual standard error) can be calculated by the Excel function STEYX()
.