Title: Correlation and Regression
1Correlation and Regression
Slides by Brad Evanoff, MD, MPH Talk by Brian
Gage, MD, MSc
2Overview of Correlation and Regression
- Correlation seeks to establish whether a
relationship exists between two variables - Regression seeks to use one variable to predict
another variable - Both measure the extent of a linear relationship
between two variables - Statistical tests are used to determine the
strength of the relationship
3Nondependent and Dependent Relationships
- Types of Relationship
- Nondependent (correlation) -- neither one of
variables is target Example protein and fat
intake - Dependent (regression) -- value of one variable
is used to predict value of another variable.
Example ACT and MCAT scores for medical
applicants, MCAT is the dependent and ACT is the
independent variable - Statistical Expressions
- Correlation Coefficient -- index of nondependent
relationship - Regression Coefficient -- index of dependent
relationship
4Example
- Measure the daily fecal lipid and fecal energy
for 20 children with cystic fibrosis - Plot each individual as a point on a graph which
has fecal lipid on one axis and fecal energy on
the other axis - What does the distribution of these values look
like?
5(No Transcript)
6(No Transcript)
7Pearsons Product Moment Correlation Coefficient
- The correlation coefficient, r, is a measure of
the interdependent relationship between two
continuous variables - For two variables, x and y, the correlation
coefficient measures the extent to which greater
values of x are associated with greater values of
y
8- The value of r can range from -1 to 1
- Absolute values close to 1, with either sign,
will represent a close correlation - Values close to 0 will represent little or no
correlation
9r ?
10r ?
11r ?
12r ?
13r ?
14r ?
15Importance of Scatterplots and Examining the Data
- Scatterplot F shows the relationship between
temperature and number of nerve fiber discharges - The scatterplot demonstrates a strong
relationship - However, the correlation coefficient, which only
measures a linear relationship, has a value of
zero (Note that scatterplot E also has an r value
of zero but clearly no relationship exists
between the two variables)
16- r values can be tested to see if an observed
correlation is statistically significant - The same distinction between magnitude of effect
and statistical significance must be made as for
other tests - a large sample may make small
correlations statistically significant yet
clinically meaningless
17(No Transcript)
18(No Transcript)
19(No Transcript)
20Coefficient of Determination, r 2
- To understand the strength of the relationship
between two variables - The correlation coefficient, r, is squared
- r 2 shows how much of the variation in one
measure (say, fecal energy) is accounted for by
knowing the value of the other measure (fecal
lipid loss)
21- For the cystic fibrosis patients, r .42 and r2
.18 - 18 of the variation in fecal energy may be
accounted for by knowing fecal lipid loss
(or vice versa)
22(No Transcript)
23Caveats
- Correlation does not imply causation
- Correlation measures only linear association, and
many biological systems are better described by
curvilinear plots - This is one reason why data should always be
looked at first (scatterplot)
24- Correlation coefficient assumes normally
distributed data - The correlation coefficient is sensitive to
extreme values - Non-normal distributions can be transformed
(e.g., logarithmic transformation) or converted
into ranks and non-parametric correlation test
can be used (Spearmans rank correlation)
25Types of Coefficients
Type of Data Continuous v. Continuous Continuous
v. Ordinal Ordinal v. Ordinal
Correlation Coefficient Pearsons r Jaspens
Multiserial Coefficient (M) Spearmans r
(Rho) Kendalls t (Tau)
26Linear Regression
- Used when the goal is to predict the value of one
characteristic from knowledge of another - Assumes a straight-line, or linear, relationship
between two variables - But the variable can be transformed 1st
- When term simple is used with regression, it
refers to situation where one explanatory
(independent) variable is used to predict another
- Multiple regression is used for more than one
explanatory variable
27- If the point at which the line intercepts or
crosses the Y-axis is a and the slope of the line
is denoted as b, then Y ß1X ß0 - Like y mx b
- The slope is a measure of how much Y changes for
a one-unit change in X
28(No Transcript)
29- Because the points rarely fall along a perfect
straight line, there is also an error term e - The formula then becomes Y ß1 X ß0 e
- The error term is a measure of the amount that
the actual Y values depart from the Y values
predicted by the equation - Regression lines are fitted using a measure
called least squares, which attempts to find the
line which minimizes the sum of these errors
(each of which is squared in the equations)
30(No Transcript)
31Example
- Investigators want to be able to predict a
potential medical school applicants MCAT scores
from his or her previous ACT examination score - Create scatterplot of ACT and MCAT test scores
- Calculate the regression equation for ACT scores
and MCAT scores
32r ?
33Y -1.61 0.406X, where Y is the predicted
MCAT score and X is the ACT score
R 0.62
34- This model of simple linear regression can be
extended to situations where there is more than
one independent variable of interest - The equation below shows a model which predicts Y
based on three independent variables, X1 ,X2 ,
and X3
35Multiple Regression
- Just like simple linear regression, but with more
variables - Allows the independent effects of several
variables to be studied at once can examine
contribution of any variable while controlling
for effects of other variables - Useful when predictor (independent) variables and
the outcome (dependent) variable are numerical
(continuous) e.g., weight, age, Hct.
36Multiple Regression
-
- Y estimated value for dependent (outcome)
variable - ß0 intercept
- ß1 partial regression coefficients indicate
how much Y changes for each unit of
change in X, when all other variables in
the model held constant - Xi independent (predictor) variables
37Multiple regression R
- Multiple R correlation coefficient indicates
correspondence between Y values predicted by the
model and Y values observed. - R2 amount of variability in Y explained by
variation in the X variables contained in the
model - Model calculates partial R values - correlation
coefficient of individual variables - as well as
R for the whole model
38Results of Stepwise Regression Predicting
Resident Performance
39Building A Multiple Regression Model
- Usual case picking a few significant variables
from many candidate variables - Variables can be included because of clinical
significance (forced into the model) or because
of statistical significance - Statistical significance usually determined by a
stepwise process
40Forward Selection
- Picks the X variable with the highest R, puts in
the model - Then looks for the X variable which will increase
R2 by the highest amount - Test for statistical significance performed
(using the F test) - If statistically significant, the new variable is
included in the model, and the variable with the
next highest R2 is tested - The selection stops when no variable can be added
which significantly increases R2
41Backwards Elimination
- Starts with all variables in the model
- Removes the X variable which results in the
smallest change in R2 - Continues to remove variables from the model
until removal produces a statistically
significant drop in R2
42Stepwise regression
- Similar to forward selection, but after each new
X added to the model, all X variables already in
the model are re-checked to see if the addition
of the new variable has effected their
significance - Bizarre, but unfortunately true running forward
selection, backward elimination, and stepwise
regression on the same data often gives different
answers
43Multiple Regression Caveats
- Try not to include predictor variables which are
highly correlated with each other - One X may force the other out, with strange
results - Overfitting too many variables make for an
unstable model - Common rule of thumb need gt 10 subjects (or
events) for each X variable - Model assumes normal distribution for Y variable
- widely skewed data may give misleading results
44(No Transcript)
45(No Transcript)